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Preface 


1 What Is The Companion ? 

The Princeton Companion to Applied Mathematics de- 
scribes what applied mathematics is about, why it 
is important, its connections with other disciplines, 
and some of the main areas of current research. It 
also explains what applied mathematicians do, which 
includes not only studying the subject itself but also 
writing about mathematics, teaching it, and influencing 
policy makers. 

The Companion differs from an encyclopedia in that 
it is not an exhaustive treatment of the subject, and it 
differs from a handbook in that it does not cover all 
relevant methods and techniques. Instead, the aim is 
to offer a broad but selective coverage that conveys 
the excitement of modern applied mathematics while 
also giving an appreciation of its history and the out- 
standing challenges. The Companion focuses on topics 
felt by the editors to be of enduring interest, and so it 
should remain relevant for many years to come. 

With online sources of information about mathemat- 
ics growing ever more extensive, one might ask what 
role a printed volume such as this has. Certainly, one 
can use Google to search for almost any topic in the 
book and find relevant material, perhaps on Wikipedia. 
What distinguishes The Companion is that it is a self- 
contained, structured reference work giving a consis- 
tent treatment of the subject. The content has been 
curated by an editorial board of applied mathemati- 
cians with a wide range of interests and experience, the 
articles have been written by leading experts and have 
been rigorously edited and copyedited, and the whole 
volume is thoroughly cross-referenced and indexed. 

Within each article, the authors and editors have tried 
hard to convey the motivation for each topic or concept 
and the basic ideas behind it, while avoiding unnec- 
essary detail. It is hoped that The Companion will be 
seen as a friendly and inspiring reference, containing 
both standard material and more unusual, novel, or 
unexpected topics. 


2 Scope 

It is difficult to give a precise definition of applied math- 
ematics, as discussed in what is applied mathemat- 
ics? [1.1] and, from a historical perspective, in the his- 
tory of applied mathematics [1.6]. The Companion 
treats applied mathematics in a broad sense, and it 
cannot cover all aspects in equal depth. Some parts 
of mathematical physics are included, though a full 
treatment of modern fundamental theories is not given. 
Statistics and probability are not explicitly included, 
although a number of articles make use of ideas from 
these subjects, and in particular the burgeoning area of 
uncertainty quantification [11.34] brings together 
many ideas from applied mathematics and statistics. 
Applied mathematics increasingly makes use of algo- 
rithms and computation, and a number of aspects at 
the interface with computer science are included. Some 
parts of discrete and combinatorial mathematics are 
also covered. 

3 Audience 

The target audience for The Companion is mathe- 
maticians at undergraduate level or above; students, 
researchers, and professionals in other subjects who 
use mathematics; and mathematically interested lay 
readers. Some articles will also be accessible to stu- 
dents studying mathematics at pre-university level. 

Prospective research students might use the book to 
obtain some idea of the different areas of applied math- 
ematics that they could work in. Researchers who reg- 
ularly attend seminars in areas outside their own spe- 
cialities should find that the articles provide a gentle 
introduction to some of these areas, making good pre- 
or post-seminar reading. 

In soliciting and editing the articles the editors aimed 
to maximize accessibility by keeping discussions at the 
lowest practical level. A good question is how much 
of the book a reader should expect to understand. 
Of course “understanding” is an imprecisely defined 
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concept. It is one thing to read along with an argument 
and find it plausible, or even convincing, but another 
to reproduce it on a blank piece of paper, as every 
undergraduate discovers at exam time. The very wide 
range of topics covered means that it would take a 
reader with an unusually broad knowledge to under- 
stand everything, but every reader from undergradu- 
ate level upward should find a substantial portion of 
the book accessible. 

4 Organization 

The Companion is organized in eight parts, which are 
designed to cut across applied mathematics in different 
ways. 

Part I, “Introduction to Applied Mathematics,” begins 
by discussing what applied mathematics is and giv- 
ing examples of the use of applied mathematics in 
everyday life, the language of applied mathemat- 
ics [1.2] then presents basic definitions, notation, and 
concepts that are needed frequently in other parts of 
the book, essentially giving a brief overview of some 
key parts of undergraduate mathematics. This arti- 
cle is not meant to be a complete survey, and many 
later articles provide other introductory material them- 
selves. methods of solution [1.3] describes some gen- 
eral solution techniques used in applied mathematics. 
algorithms [1.4] explains the concept of an algorithm, 
giving some important examples and discussing com- 
plexity issues. The presence of this article in part 1 
reflects the increasing importance of algorithms in all 
areas of applied mathematics, goals of applied math- 
ematical research [1.5] describes the kinds of ques- 
tions and issues that research in applied mathematics 
addresses and discusses some strategic aspects of car- 
rying out research. Finally, the history of applied 
mathematics [1.6] describes the history of the subject 
from ancient times up until the late twentieth century. 

Part II, “Concepts,” comprises short articles that 
explain specific concepts and their significance. These 
are mainly concepts that cut across different models 
and areas and provide connections to other parts of the 
book. This part is not meant to be comprehensive, and 
many other concepts are well described in later articles 
(and discoverable via the index). 

Part III, “Equations, Laws, and Functions of Applied 
Mathematics,” treats important examples of what its 
title describes. The choice of what to include was based 
on a mix of importance, accessibility, and interest. 
Many equations, laws, and functions not contained in 
this part are included in other articles. 


Part IV, “Areas of Applied Mathematics,” contains 
longer articles giving an overview of the whole sub- 
ject and how it is organized, arranged by research 
area. The aim of this part is to convey the breadth, 
depth, and diversity of applied mathematics research. 
The coverage is not comprehensive, but areas that 
do not appear as or in article titles may neverthe- 
less be present in other articles. For example, there is 
no article on geoscience, yet earth system dynam- 
ics [IV.30], inverse problems [IV.15], and imaging 
THE EARTH USING GREEN’S THEOREM [VII. 16] all cover 
specific aspects of this area. Nor is there a part IV 
article on numerical analysis, but this area is rep- 
resented by APPROXIMATION THEORY [1V.9], NUMERI- 
CAL LINEAR ALGEBRA AND MATRIX ANALYSIS [IV. 10], 
CONTINUOUS OPTIMIZATION (NONLINEAR AND LINEAR 
PROGRAMMING) [IV. 11], NUMERICAL SOLUTION OF ORDI- 
NARY DIFFERENTIAL EQUATIONS [IV. 12], and NUMERI- 
CAL SOLUTION OF PARTIAL DIFFERENTIAL EQUATIONS 

[IV.13]. 

Part V, “Modeling,” gives a selection of mathemati- 
cal models, explaining how the models are derived and 
how they are solved. 

Part VI, “Example Problems,” contains short articles 
covering a variety of interesting applied mathematics 
problems. 

Part VII, “Application Areas,” comprises articles on 
connections between applied mathematics and other 
disciplines, including such diverse topics as integrated 
circuit (chip) design, medical imaging, and the screen- 
ing of luggage in airports. 

Part VIII, “Final Perspectives,” contains essays on 
broader aspects, including reading, writing, and type- 
setting mathematics; teaching applied mathematics; 
and how to influence government as a mathematician. 

The articles within a given part vary significantly in 
length. This should not be taken as an indication of the 
importance of the corresponding topic, as it is partly 
due to the number of pages that could be allocated to 
each article, as well as to how authors responded to 
their given page limit. 

The ordering of articles within a part is alphabeti- 
cal for parts II and III. For part IV some attempt was 
made to place related articles together and to place 
one article before another if there is a natural order 
in which to read the two articles. The ordering is never- 
theless somewhat arbitrary, and the reader should feel 
free to read the articles in any order. The articles within 
parts V-VIII are arranged only loosely by theme. 
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The editors made an effort to encourage illustrations 
and diagrams. Due to the cost of reproduction, color 
has been used only where necessary and is restricted 
to the color plates following page 364. 

Despite the careful organization, the editors expect 
that many readers will flick through the book to find 
something interesting, start reading, and by following 
cross-references navigate the book in an unpredictable 
fashion. This approach is perfectly reasonable, as is 
starting from a page found via the table of contents 
or the index. Whatever the reading strategy, we hope 
the book will be hard to put down! 

5 Relation to The Princeton 
Companion to Mathematics 

This Companion follows the highly successful Prince- 
ton Companion to Mathematics (PCM), edited by Gow- 
ers, Barrow-Green, and Leader (2008), which focuses on 
modern pure mathematics. We have tried to build on 
the PCM and avoid overlap with it. Thus we do not cover 
many of the basic mathematical concepts treated in 
parts I and II of the PCM, but rather assume the reader 
is familiar with them. 

Some crucial concepts that are already in the PCM 
are included here as well, but they are approached 
from a rather different viewpoint, typically with more 
discussion of applications and computational aspects. 

Some articles in the PCM, listed in table 1, could 
equally well have appeared here, and the editors there- 
fore made a decision not to solicit articles on these 
same general topics. However, particular aspects of 
several of these topics are included. 

6 How to Use This Book 

Authors were asked to make their articles as self- 
contained as possible and to define specific notation 
and technical terms. You should familiarize yourself 
with the material in the language of applied math- 
ematics [1.2], as this background is assumed for many 
of the later articles. If you are unsure about notation 
consult table 3 in 1.2 or, for a definition, see if it is in 
the index. The editors, with the help of a professional 
indexer, have tried very hard to produce a thorough and 
usable index, so there is a good chance that the index 
will lead you to a place in the book where a particular 
piece of notation or a definition is clarified. 

The extensive cross-references provide links between 
articles. A phrase such as “this vector can be computed 
by the fast Fourier transform [11.10]” indicates that 


Table 1 Relevant articles from The Princeton Companion 
to Mathematics whose topic is not duplicated here. 


Title 

PCM 

article 

number 

Mathematics and Chemistry 

VII. 1 

Mathematical Biology 

VII. 2 

Wavelets and Applications 

VII. 3 

The Mathematics of Traffic in Networks 

VH.4 

Mathematics and Cryptography 

VII. 7 

Mathematical Statistics 

VII. 10 

Mathematics and Medical Statistics 

vn.ii 

Mathematics and Music 

VII. 13 


article 10 in part II contains more information on the 
fast Fourier transform. 

In the research literature it is normal to support 
statements by citations to items in a bibliography. It 
is a style decision of The Companion not to include 
citations in the articles. The articles present core know- 
ledge that is generally accepted in the field, and omit- 
ting citations makes for a smoother reading experience 
as the reader does not constantly have to refer to a list 
of references a few pages away. Many authors found it 
quite difficult to adopt this style, being so used to lib- 
erally sprinkling \ci te commands through their ETjX 
documents! Most articles have a further reading sec- 
tion, which provides a small number of sources that 
provide an entree into the literature on that topic. 

7 The Companion Project 

I was invited to lead the project in late 2009. After the 
editorial board was assembled and the format of the 
book and the outline of its contents were agreed at 
a meeting of the board in Manchester, invitations to 
authors began to go out in 2011. We aimed to invite 
authors who are both leaders in their field and excellent 
writers. We were delighted with the high acceptance 
rate. 

Ludwig Boltzmann was a contributor to the six- 
volume Encyclopedia of Mathematical Sciences (1898- 
1933) edited by Felix Klein, Wilhelm Meyer, and others. 
In 1905, he wrote, apropos of the selected author of an 
article: 

He must first be persuaded to promise a contribution; 
then, he must be instructed and pressed with all means 
of persuasion to write a contribution which fits into the 
general framework; and last, but not least, he must be 
urged to fulfill his promise in a timely matter. 
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Over a hundred years later these comments remain 
true, and one of the reasons for the long gestation 
period of a book such as this is that it does take a long 
time to sign up authors and collect articles. Occasion- 
ally, we were unable to find an author willing and able 
to deliver an article on a particular topic, so a small 
number of topics that we would have liked to include 
had to be omitted. 

Of the 165 authors, at least two had babies dur- 
ing the course of preparing their articles. Sadly, one 
author, David Broomhead, did not live to see the project 
completed. 

If the project has gone well, one of the reasons is 
the thoroughly professional and ever-cheerful support 
provided by Sam Clark of T&T Productions Ltd in Lon- 
don. Sam acted as project manager, copy editor, and 
typesetter, and made the process as smooth and pain- 
less for the editors and contributors as it could be. Sam 
played the same role for the The Princeton Companion 
to Mathematics, and his experience of that project was 
invaluable. 
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Part I 

Introduction to 
Applied Mathematics 


I.l What Is Applied Mathematics? 

Nicholas J. Higham 


1 The Big Picture 

Applied mathematics is a large subject that interfaces 
with many other fields. Trying to define it is problem- 
atic, as noted by William Prager and Richard Courant, 
who set up two of the first centers of applied mathemat- 
ics in the United States in the first half of the twentieth 
century, at Brown University and New York University, 
respectively. They explained that: 

Precisely to define applied mathematics is next to 
impossible. It camiot be done in terms of subject mat- 
ter: the borderline between theory and application is 
highly subjective and shifts with time. Nor can it be 
done in terms of motivation: to study a mathematical 
problem for its own sake is surely not the exclusive 
privilege of pure mathematicians. Perhaps the best I 
can do within the framework of this talk is to describe 
applied mathematics as the bridge connecting pure 
mathematics with science and technology. 

Prager (1972) 

Applied mathematics is not a definable scientific held 
but a human attitude. The attitude of the applied sci- 
entist is directed towards finding clear cut answers 
which can stand the test of empirical observation. To 
obtain the answers to theoretically often insuperably 
difficult problems, he must be willing to make com- 
promises regarding rigorous mathematical complete- 
ness; he must supplement theoretical reasoning by 
numerical work, plausibihty considerations and so on. 

Courant (1965) 

Garrett Birkhoff offered the following view in 1977, 
with reference to the mathematician and physicist Lord 
Rayleigh (John William Strutt, 1842-1919): 



Figure 1 The main steps in solving 
a problem in applied mathematics. 


Essentially, mathematics becomes “applied” when it is 
used to solve real-world problems “neither seeking nor 
avoiding mathematical difficulties” (Rayleigh). 

Rather than define what applied mathematics is, one 
can describe the methods used in it. Peter Lax stated of 
these methods, in 1989, that: 

Some of them are organic parts of pure mathemat- 
ics: rigorous proofs of precisely stated theorems. 
But for the greatest part the applied mathematician 
must rely on other weapons: special solutions, asymp- 
totic description, simplified equations, experimenta- 
tion both in the laboratory and on the computer. 

Here, instead of attempting to give our own definition 
of applied mathematics we describe the various facets 
of the subject, as organized around solving a problem. 
The main steps are described in figure 1. Let us go 
through each of these steps in turn. 
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I. Introduction to Applied Mathematics 


Modeling a problem. Modeling is about taking a phys- 
ical problem and developing equations— differential, 
difference, integral, or algebraic— that capture the es- 
sential features of the problem and so can be used 
to obtain qualitative or quantitative understanding of 
its behavior. Here, “physical problem” might refer to 
a vibrating string, the spread of an infectious disease, 
or the influence of people participating in a social net- 
work. Modeling is necessarily imperfect and requires 
simplifying assumptions. One needs to retain enough 
aspects of the system being studied that the model 
reproduces the most important behavior but not so 
many that the model is too hard to analyze. Different 
types of models might be feasible (continuous, discrete, 
stochastic), and for a given type there can be many 
possibilities. Not all applied mathematicians carry out 
modeling; in fact, most join the process at the next step. 

Analyzing the mathematical problem. The equations 
formulated in the previous step are now analyzed and, 
ideally, solved. In practice, an explicit, easily evalu- 
ated solution usually cannot be obtained, so approxi- 
mations may have to be made, e.g., by discretizing a dif- 
ferential equation, producing a reduced problem. The 
techniques necessary for the analysis of the equations 
or reduced problem may not exist, so this step may 
involve developing appropriate new techniques. If ana- 
lytic or perturbation methods have been used then the 
process may jump from here directly to validation of 
the model. 

Developing algorithms. It may be possible to solve 
the reduced problem using an existing algorithm— a 
sequence of steps that can be followed mechanically 
without the need for ingenuity. Even if a suitable algo- 
rithm exists it may not be fast or accurate enough, may 
not exploit available structure or other problem fea- 
tures, or may not fully exploit the architecture of the 
computer on which it is to be run. It is therefore often 
necessary to develop new or improved algorithms. 

Writing software. In order to use algorithms on a 
computer it is necessary to implement them in soft- 
ware. Writing reliable, efficient software is not easy, 
and depending on the computer environment being tar- 
geted it can be a highly specialized task. The necessary 
software may already be available, perhaps in a package 
or program library. If it is not, software is ideally devel- 
oped and documented to a high standard and made 
available to others. In many cases the software stage 
consists simply of writing short programs, scripts, or 


notebooks that carry out the necessary computations 
and summarize the results, perhaps graphically. 

Computational experiments. The software is now 
run on problem instances and solutions obtained. The 
computations could be numeric or symbolic, or a mix- 
ture of the two. 

Validation of the model. The final step is to take the 
results from the experiments (or from the analysis, if 
the previous three steps were not needed), interpret 
them (which may be a nontrivial task), and see if they 
agree with the observed behavior of the original sys- 
tem. If the agreement is not sufficiently good then the 
model can be modified and the loop through the steps 
repeated. The validation step may be impossible, as the 
system in question may not yet have been built (e.g., a 
bridge or a building). 

Other important tasks for some problems, which 
are not explicitly shown in our outline, are to cali- 
brate parameters in a model, to quantify the uncer- 
tainty in these parameters, and to analyze the effect 
of that uncertainty on the solution of the problem. 
These steps fall under the heading of uncertainty 
QUANTIFICATION [11.34]. 

Once all the steps have been successfully completed 
the mathematical model can be used to make predic- 
tions, compare competing hypotheses, and so on. A 
key aim is that the mathematical analysis gives new 
insights into the physical problem, even though the 
mathematical model may be a simplification of it. 

A particular applied mathematician is most likely to 
work on just some of the steps; indeed, except for rela- 
tively simple problems it is rare for one person to have 
the skills to carry out the whole process from modeling 
to computer solution and validation. 

In some cases the original problem may have been 
communicated by a scientist in a different field. A sig- 
nificant effort can be required to understand what the 
mathematical problem is and, when it is eventually 
solved, to translate the findings back into the language 
of the relevant field. Being able to talk to people out- 
side mathematics is therefore a valuable skill for the 
applied mathematician. 

It would be wrong to give the impression that all 
applied mathematics is done in the context of model- 
ing. Frequently, a mathematical problem will be tack- 
led because of its inherent interest (see the quote from 
Prager above) with the hope or expectation that a rel- 
evant application will be found. Indeed some applied 
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mathematicians spend their whole careers working in 
this way. There are many examples of mathemati- 
cal results that provide the foundations for impor- 
tant practical applications but were developed without 
knowledge of those applications (sections 3.1 and 3.2 
provide such examples). 

Before the twentieth century, applied mathematics 
was driven by problems in astronomy and mechan- 
ics. In the twentieth century physics became the main 
driver, with other areas such as biology, chemistry, eco- 
nomics, engineering, and medicine also providing many 
challenging mathematical problems from the 1950s 
onward. With the massive and still-growing amounts 
of data available to us in today’s digital society we can 
expect information, in its many guises, to be an increas- 
ingly important influence on applied mathematics in 
the twenty-first century. 

For more on the definition and history of applied 
mathematics, including the development of the term 
“applied mathematics,” see the article history of 
APPLIED MATHEMATICS [1.6]. 

2 Applied Mathematics and Pure Mathematics 

The question of how applied mathematics compares 
with pure mathematics is often raised and has been 
discussed by many authors, sometimes in controversial 
terms. We give a few highlights. 

Paul Halmos wrote a 1981 paper provocatively titled 
“Applied mathematics is bad mathematics.” However, 
much of what Halmos says would not be disputed by 
many applied mathematicians. For example: 

Pure mathematics can be practically useful and applied 
mathematics can be artistically elegant.... 

Just as pure mathematics can be useful, applied math- 
ematics can be more beautifully useless than is some- 
times recognized.... 

Applied mathematics is an intellectual discipline, not 
a part of industrial technology. . . . 

Not only, as is universally admitted, does the applied 
need the pure, but, in order to keep from becoming 
inbred, sterile, meaningless, and dead, the pure needs 
the revitalization and the contact with reality that only 
the applied can provide. 

G. H. Hardy’s book A Mathematician’s Apology (1940) 
is well known as a defense of mathematics as a 
subject that can be pursued for its own sake and 
beauty. As such it contains some criticism of applied 
mathematics: 


But is not the position of an ordinary applied mathe- 
matician in some ways a little pathetic? If he wants to 
be useful, he must work in a humdrum way, and he can- 
not give full play to his fancy even when he wishes to 
rise to the heights. “Imaginary” universes are so much 
more beautiful than this stupidly constructed “real” 
one; and most of the finest products of an applied 
mathematician’s fancy must be rejected, as soon as 
they have been created, for the brutal but sufficient 
reason that they do not fit the facts. 

Halmos and Hardy were pure mathematicians. Ap- 
plied mathematicians C. C. Lin and L. A. Segel offer 
some insights in the introductory chapter of their clas- 
sic 1974 book Mathematics Applied to Deterministic 
Problems in the Natural Sciences: 

The differences in motivation and objectives between 
pure and applied mathematics— and the consequent 
differences in emphasis and attitude— must be fully 
recognized. In pure mathematics, one is often deal- 
ing with such abstract concepts that logic remains the 
only tool permitting judgment of the correctness of 
a theory. In applied mathematics, empirical verifica- 
tion is a necessary and powerful judge. However ... in 
some cases (e.g., celestial mechanics), rigorous theo- 
rems can be proved that are also valuable for practical 
purposes. On the other hand, there are many instances 
in which new mathematical ideas and new mathemati- 
cal theories are stimulated by applied mathematicians 
or theoretical scientists. 

They also opine that: 

Much second-rate pure mathematics is concealed be- 
neath the trappings of applied mathematics (and vice 
versa). As always, knowledge and taste are needed if 
quality is to be assured. 

The applied versus pure discussion is not always 
taken too seriously. Chandler Davis quotes the applied 
mathematician Joseph Keller as saying, “pure mathe- 
matics is a subfield of applied mathematics”! 

The discussion can also focus on where in the spec- 
trum a particular type of mathematics lies. An inter- 
esting story was told in 1988 by Clifford Truesdell 
of his cofounding in 1952 of the Journal of Rational 
Mechanics and Analysis (which later became Archive for 
Rational Mechanics and Analysis). He explained that 

In those days papers on the foundation of continuum 
mechanics were rejected by journals of mathematics 
as being applied, by journals of “applied” mathematics 
as being physics or pure mathematics, by journals of 
physics as being mathematics, and by all of them as too 
long, too expensive to print, and of interest to no one. 
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3 Applied Mathematics in Everyday Life 

We now give three examples of applied mathemat- 
ics in use in everyday life. These examples were cho- 
sen because they can be described without delving 
into too many technicalities and because they illus- 
trate different characteristics. Some of the terms used 
in the descriptions are explained in the language of 
APPLIED MATHEMATICS [1.2]. 

3.1 Searching Web Pages 

In the early to mid-1990s — the early days of the World 
Wide Web— search engines would find Web pages that 
matched a user’s search query and would order the 
results by a simple criterion such as the number of 
times that the search query appears on a page. This 
approach became unsatisfactory as the Web grew in 
size and spammers learned how to influence the search 
results. From the late 1990s onward, more sophisti- 
cated criteria were developed, based on analysis of 
the links between Web pages. One of these is Google’s 
PageRank algorithm [VI.9]. Another is the hyperlink- 
induced topic search (HITS) algorithm of Kleinberg. 

The HITS algorithm is based on the idea of deter- 
mining hubs and authorities. Authorities are Web pages 
with many links to them and for which the linking pages 
point to many authorities. For example, the New York 
Times home page or a Wikipedia article on a popular 
topic might be an authority. Hubs are pages that point 
to many authorities. An example might be a page on 
a programming language that provides links to useful 
pages about that language but that does not necessar- 
ily contain much content itself. The authorities are the 
pages that we would like to rank higher among pages 
that match a search term. However, the definition of 
hubs and authorities is circular, as each depends on 
the other. 

To resolve this circularity, associate an authority 
weight Xi and a hub weight Vi with page i, with both 
weights nonnegative. Let there be n pages to be con- 
sidered (in practice this is a much smaller number than 
the total number of pages that match the search term). 
Define an n x n matrix A = (atj) by ay = 1 if there 
is a hyperlink from page i to page j and by ay = 0 
otherwise. Let us make initial guesses x] 0> = 1 and 
y- 0) = 1, for i = 1, 2, . . . , n. It is reasonable to update 
the authority weight X; for page i by replacing it by the 
sum of the weights of the hubs that point to it. Simi- 
larly, the hub weight V; for page i can be replaced by 
the sum of the weights of the authorities to which it 


points. In equations, these updates can be written as 
x\ 1} = ’La i Aoyf ) and y^ = X- 1 ’; note that in 

the latter equation we are using the updated author- 
ity weights, and the sums are over those j for which 
a.ji or ay is nonzero, respectively. This process can be 
iterated: 
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The circular definition of hubs and authorities has 
been turned into an iteration. The iteration is best ana- 
lyzed by rewriting it in matrix-vector form. Defining 
the n -vectors 


X k =[x?\x? ) X<*’] T , 

y k = [yi k \y!> k) y</°] T 


and recalling that the elements of A are 0 or 1 , we can 
rewrite the iteration as 


*it+i = A T y k 
y k+ i = Ax k+1 


k = 0,1,2,..., 


where A J = { ap ) is the transpose of A. Combining the 
two formulas into one gives xj; + i = A J y k = A J (Ax k ) = 
{A T A)x k . Hence the x k are generated by repeatedly 
multiplying by the matrix A J A. Each element of A J A is 
either zero or a positive integer, so the powers of A J A 
will usually grow without bound. In practice we should 
therefore normalize the vectors x k and y k so that the 
largest element is 1; this avoids overflow and has no 
effect on the relative sizes of the components, which is 
all that matters. Our iteration is then 


x k + 1 = c k l A J Ax k , 

where c k is the largest element of A r Ax k . If the 
sequences x k and c k converge, say to x* and c* , respec- 
tively, then A t Ax* = c*x*. This equation says that x* 
is an eigenvector of A J A with corresponding eigenvalue 
c*. A similar argument shows that, if the normalized 
sequence of vectors y k converges, then it must be to 
an eigenvector of AA T . 

This process of repeated multiplication by a matrix is 
known as the power method [IV. 10 §5.5]. The perron- 
frobenius theorem [IV.10§11.1] can be used to show 
that, provided the matrix A T A has a property called irre- 
ducibility, it has a unique eigenvalue of largest magni- 
tude and this eigenvalue is real and positive, v\ith an 
associated eigenvector x having positive entries. Con- 
vergence theory for the power method then shows that 
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Figure 2 Graph representing four Web pages 
and the links between them. 

Xk converges to a multiple of x. Therefore, the vec- 
tors x* and y* required for the HITS algorithm are 
the eigenvectors corresponding to the dominant eigen- 
values of A t A and AA J , respectively. Another inter- 
pretation of these vectors is that they are the right 
and left singular vectors [II. 3 2] corresponding to the 
dominant singular value of A. 

We give a simple example to illustrate the algorithm. 
Consider the network of four Web pages shown in fig- 
ure 2 in a directed graph [11.16] representation. The 
corresponding matrix is 



The HITS algorithm produces 



' o ' 


V 


i 


0 

X = 

0.5 

■ y = 

1 


0.5 
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This tells us that page 2 is the highest-ranked authority 
and pages 1 and 3 are the jointly highest-ranked hubs, 
which is an intuitively reasonable ranking. 

This Web search example illustrates the steps in fig- 
ure 1. The first step, modeling, yields the notion of 
hubs and authorities. The production of the iteration 
for the weights is the analysis step, which for this prob- 
lem overlaps with the third step of developing algo- 
rithms, because the iteration is a practical means for 
computing the hubs and authorities. However, other 
algorithms could be used for this purpose, so much 
more can be done in the third step. Implementing the 
iteration in software (the fourth step) is straightfor- 
ward if we assume the matrix A is given. The compu- 
tational experiments in the fifth step should use real 
data from particular Web searches. Probably the most 
difficult step is the final one: validating the model. The 


reason for this is that there is no natural quantitative 
measure of the goodness of a ranking based on the 
computed authorities; this assessment requires human 
judgment. 

A final point to note is that further insight into the 
HITS algorithm can be obtained using graph theory 
[11.16]. In general, there maybe multiple ways to analyze 
a problem and it maybe necessary to employ more than 
one technique to obtain a complete understanding. 

3.2 Digital Imaging 

Image retouching refers to the process of changing 
a digital image to make it more visually pleasing by 
removing a color cast, creatively adjusting color and 
contrast, smoothing out wrinkles on a subject’s face 
in a portrait, and so on. In the days of film cameras, 
image retouching was carried out by professionals on 
scanned images using expensive hardware and soft- 
ware. Since the advent of digital cameras retouching 
has become something that anyone can attempt, even 
on a smartphone with a suitable app. An operation 
called “cloning” is of particular interest. Cloning is the 
operation of copying one area of an image over another 
and is commonly used to remove defects or unwanted 
elements from an image, such as dust spots on the cam- 
era sensor, litter in a landscape, or someone’s limbs 
intruding into the edge of an image. 

Cloning tools in modern software use sophisticated 
mathematics to blend the image fragment being copied 
into the target area. Let us represent the image as a 
function / of two variables, where f(x,y) is an RGB 
(red, green, blue) triple corresponding to the point 
(x,y). In practice, an image is a discrete grid of points 
and the RGB values are integers. Our goal is to replace 
an open target region Q by a source region that has the 
same shape and size but is in a different location in 
the image, and that therefore corresponds to a trans- 
lation (5x,5y). If we simply copy the source region 
into the target, the result will not be convincing visu- 
ally, as it neither preserves texture nor matches up well 
around the boundary dQ (see figure 3). To alleviate 
these problems we can replace / inside Q by the func- 
tion g defined by the partial differential equation (PDE) 
problem 

A g(x,y) = A f(x + Sx, y + 5y ), ( x,y ) 6 Q, 
g(x,y) = f(x, y), (x,y)edn, 

where A = 3 2 /3x 2 + d 2 /dy 2 is the laplace opera- 
tor [III. 18]. We are forcing g to be identical to / on 
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the boundary of the target region, but inside the region 
the Laplacian of g has to match that of / in the source 
region. This is a Poisson equation with Dirichlet bound- 
ary condition. Since the Laplace operator is connected 
with diffusion effects in equilibrium, we can think of the 
pixels in the image diffusing to form a more convinc- 
ing visual result. Another interpretation, which shows 
that the solution has optimal smoothness in a certain 
sense, is that we are minimizing II Vg - V/|| 2 dl2, 
where V = [d/dx d/d y] T is the gradient operator and 
|| ■ || is the L 2 -NORM [1.2 §19.3]. In practice, the PDE is 
solved by a numerical method [IV.13]. 

Adobe Photoshop, the digital image manipulation 
software originally released in 1990, introduced a new 
feature called the healing brush in 2002. It carries out 
the sophisticated cloning just outlined but solves the 
biharmonic equation A 2 g(x,y) = A 2 fix + Sx,v + 
8y), where A 2 = d 4 /dx 4 + 2d 4 / (dx 2 dy 2 ) + d 4 /dy 4 , 
because this has been found to provide better match- 
ing of derivatives on the boundary and thereby pro- 
duces a better blending of the source into the area 
containing the target. In this case we are minimizing 
f n (A g - A/) 2 dll. See figure 3. 

The idea of using the Poisson equation or the bihar- 
monic equation to fill in gaps in two-dimensional data 
appears in other application areas too, such as in map- 
ping and contouring of geophysical data, where it was 
proposed as early as the 1950s. 

A related problem is how to detect when an image has 
been subject to cloning. Applications include checking 
the veracity of images appearing in the media or used 
as evidence in legal cases and checking the eligibility of 
entries to a photographic competition in which image 
manipulation is disallowed. The same ideas that make 
cloning possible also enable its use to be detected. 
For example, even though the values in the target and 
source areas are different, the Laplacian or biharmonic 
operators yield the same values, so a systematic search 
can be done to find different areas with this property. 
One complication is that if the image has been com- 
pressed after cloning, as is typically done with jpeg 
compression [VII. 7 §5], the uncompressed image con- 
tains noise that causes the derivatives in question no 
longer to match exactly. Having to deal with noise is a 
common requirement in applied mathematics. 

3.3 Computer Arithmetic 

In modern life we rely on computers to carry out arith- 
metic calculations for a wide range of tasks, including 


computation of financial indices, aircraft flight paths, 
and utility bills. We take it for granted that these cal- 
culations produce the “correct result,” but they do not 
always do so, and there are infamous examples where 
inaccurate results have had disastrous consequences. 
Applied mathematicians and computer scientists have 
been responsible for ensuring that today’s computer 
calculations are more reliable than ever. 

Probably the most important advance in computer 
arithmetic in the last fifty years was the development 
of the 1985 IEEE standard for binary floating-point 
arithmetic [11.13]. Prior to the development of the 
standard, different computers used different imple- 
mentations of floating-point arithmetic that were of 
varying quality. For example, on some computers of the 
1980s a test such as 

1 if x /= y 

2 f = f/(x- y) 

3 end 

could fail with division by zero because the difference 
x - y evaluated to zero even though x was not equal 
to y. This failure is not possible in IEEE-compliant 
arithmetic. 

Some features of the IEEE standard are more unex- 
pected. For example, it includes a special number inf 
that represents oo and satisfies the natural relations. 
For example, 1/0 evaluates to inf, 1/inf evaluates to 0, 
and 1 + inf and inf + inf produce inf. If a program does 
divide by zero then this does not halt the program and 
it can be tested for and appropriate action taken. But 
division by zero does not necessarily signify a problem. 
For example, if we evaluate the function 


at x = 1, IEEE arithmetic correctly yields /( 1) = 1/(1 + 
oo ) = 0. In 1997 the USS Yorktown, a guided missile 
cruiser, was paralyzed (“dead in the water” according to 
the commander) for 2 1 hours when a crewman entered 
a zero into a database and a division by zero was trig- 
gered that caused the program to crash. The incident 
could have been avoided if IEEE standard features such 
as inf had been fully utilized. 

The development of the IEEE standard took many 
years and required much mathematical analysis of 
the various options. The benefits brought by the 
standard include increased portability of programs 
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Figure 3 Close-up of part of a metal sign with a hot spot from a reflection, (a) Original image showing source and target 
regions, (b) The result of copying source to target, (c) The result of one application of Photoshop healing brush with same 
target area. In practice, multiple applications of the healing brush would be used with smaller, overlapping target areas. 


between different computers and an improved abil- 
ity for mathematicians to understand the way algo- 
rithms will behave when implemented in floating-point 
arithmetic. 

In IEEE double-precision arithmetic, numbers are rep- 
resented to a precision equivalent to about sixteen 
significant decimal digits. In many situations in life, 
results are needed to far fewer figures and a final result 
must be rounded. For example, a conversion from euros 
to dollars producing an answer $110.89613 might be 
rounded up to $110.90: the nearest amount in whole 
cents. A bank paying the dollars into a customer’s 
account might prefer to round down to $110.89 and 
keep the remainder. However, deciding on the rules 
for rounding was not so simple when the euro was 
founded in 1997. A twenty-nine-page document was 
needed to specify precisely how conversions among 
the fifteen currencies of the member states and the 
euro should be done. Its pronouncements included 
how many significant figures each individual conver- 
sion rate should have (six was the number that was cho- 
sen), how rounding should be done (round to the near- 
est six-digit number), and how ties should be handled 
(always round up). 

Even when rounding should be straightforward it 
is often carried out incorrectly. In 1982 the Van- 
couver Stock Exchange established an index with an 
initial value of 1000. After twenty-two months the 
index had been hitting lows in the 520s, despite the 
exchange apparently performing well. The index was 
recorded to three decimal places and it was discov- 
ered that the computer program calculating the index 
always rounded down, hence always underestimating 
the index. Upon recalculation (presumably with round 
to nearest) the index almost doubled. 

In 2006 athlete Justin Gatlin was credited with a 
new world record of 9.76 seconds for the 100 meters. 
Almost a week after the race the time was changed to 
9.77 seconds, meaning that he had merely equaled the 
existing record held by Asafa Powell. The reason for the 


change was that his recorded time of 9.766 had incor- 
rectly been rounded down to the nearest hundredth of 
a second instead of up as the International Association 
of Athletics Federations rules require. 

4 What Do Applied Mathematicians Do? 

Applied mathematicians can work in academia, indus- 
try, or government research laboratories. Their work 
may involve research, teaching, and (especially for 
more senior mathematicians) administrative tasks such 
as managing teams of people. They usually spend only 
part of their time doing mathematics in the traditional 
sense of sitting with pen and paper scribbling formu- 
las on paper and trying to solve equations or prove 
theorems. Under the general heading of research, a lot 
of time is spent writing papers, books, grant propos- 
als, reports, lecture notes, and talks; attending semi- 
nars, conferences, and workshops; writing and running 
computer programs; reading papers in the research lit- 
erature; refereeing papers submitted to journals and 
grant proposals submitted to funding bodies; and com- 
menting on draft papers and theses written by Ph.D. 
students. 

Mathematics can be a lonely endeavor: one may be 
working on different problems from one’s colleagues or 
may be the only mathematician in a company. Although 
some applied mathematicians prefer to work alone, 
many collaborate with others, often in faraway coun- 
tries. Collaborations are frequently initiated through 
discussions at conferences, though sometimes papers 
are coauthored by people who have never met, thanks 
to the ease of email communication. 

Applied mathematics societies provide an impor- 
tant source of identity and connectivity, as well as 
opportunities for networking and professional devel- 
opment. They mostly focus on particular countries or 
regions, an exception being the Society for Industrial 
and Applied Mathematics (SIAM), based in Philadelphia. 
SIAM is the largest applied mathematics organization 



8 


I. Introduction to Applied Mathematics 


in the world and has a strong international outlook, 
with about one-third of its members residing outside 
the United States. A mathematician’s activities are fre- 
quently connected with societies, whether it be through 
publishing in or editing their journals, attending their 
conferences, or keeping up with news through their 
magazines and newsletters. Most societies offer greatly 
reduced membership fees (sometimes free member- 
ship) for students. 

Applied mathematicians can be part of multidisci- 
plinary teams. Their skills in problem solving, thinking 
logically, modeling, and programming are sought after 
in other subjects, such as medical imaging, weather 
prediction, and financial engineering. 

In the business world, applied mathematics can be 
invisible because it is called “analytics,” “modeling,” or 
simply generic “research.” But whatever their job title, 
applied mathematicians play a crucial role in today’s 
knowledge-based economy. 

5 What Is the Impact of Applied Mathematics? 

The impact of applied mathematics is illustrated in 
many articles in this volume, and in this section we pro- 
vide just a brief overview, concentrating on the impact 
outside mathematics itself. 

Applied mathematics provides the tools and algo- 
rithms to enable understanding and predictive model- 
ing of many aspects of our planet, including weather 
[V.18] (for which the accuracy of forecasts has im- 
proved greatly in recent decades), atmosphere and 
THE OCEANS [IV.30], TSUNAMIS [V.19], and SEA ICE 
[V. 17]. In many cases the models are used to inform 
policy makers. 

At least two mathematical algorithms are used by 
most of us almost every day. The fast Fourier trans- 
form [11.10] is found in any device that carries out sig- 
nal processing, such as a smartphone. Photos that we 
take on our cameras or view on a computer screen are 
usually stored using jpeg compression [VII.7 §5]. 

X-ray tomography devices, ranging from airport 
LUGGAGE SCANNERS [VII. 19] to HUMAN BODY SCANNERS 
[VII.9], rely on the fast and accurate solution of inverse 
problems [IV. 1 5], which are problems in which we need 
to recover information about the internals of a system 
from (noisy) measurements taken outside the system. 

Investments are routinely made on the basis of math- 
ematical models, whether for individual options or col- 
lections of assets (portfolios): see financial mathe- 
matics [V.9] and portfolio theory [V.10]. 


The clever use of mathematical modeling offers a 
competitive advantage in sports, such as yacht rac- 
ing [V.2], swimming [V.2], and formula one racing 
[V.3], where small improvements can be the difference 
between success and failure. 


1.2 The Language of Applied 
Mathematics 

Nicholas J. Higham 


This article provides background on the notation, ter- 
minology, and basic results and concepts of applied 
mathematics. It therefore serves as a foundation for the 
later articles, many of which cross-reference it. 

In view of the limited space, the material has been 
restricted to that common to many areas of applied 
mathematics. A number of later articles provide their 
own careful introduction to the language of their par- 
ticular topic. 

1 Notation 

Table 1 lists the Greek alphabet, which is widely used 
to denote mathematical variables. Note that almost 
always 5 and s are used to denote small quantities, and 
it is used as a variable as well as for tt = 3.14159 

Mathematics has a wealth of notation to express com- 
monly occurring concepts. But notation is both a bless- 
ing and a curse. Used carefully, it can make mathe- 
matical arguments easier to read and understand. If 
overused it can have the opposite effect, and often 
it is better to express a statement in words than in 
symbols (see mathematical writing [VIII. 1]). Table 2 
gives some notation that is common in informal con- 
texts such as lectures and is occasionally encountered 
in this book. Table 3 summarizes basic notation used 
throughout the book. 

2 Complex Numbers 

Most applied mathematics takes place in the set of com- 
plex numbers, C, or the set of real numbers, R. A com- 
plex number z = x + iy has real and imaginary parts 
x = Re z and y = Imz belonging to R, and the imag- 
inary unit i denotes A- 1 ■ The imaginary unit is some- 
times written as j, e.g., in electrical engineering and in 
the programming language python [VH.ll]. 

We can represent complex numbers geometrically in 
the complex plane, in which a complex number a + \h 
is represented by the point with coordinates (a,b) 
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Table 1 The Greek alphabet. Where an uppercase Greek 
letter is the same as the Latin letter it is not shown. 


a 

alpha 

V 

nu 

a 

beta 


XL 

y,r 

gamma 

0 

omicron 

8, A 

delta 

tt, m, n 

Pi 

£ 

epsilon 

P. Q 

rho 

c 

zeta 

a, g, 1 

sigma 

n 

eta 

T 

tau 

9, 9, 0 

theta 

u, r 

upsilon 

L 

iota 

<P, <p, 

phi 

K 

kappa 

X 

chi 

A, A 

lambda 

jj, ¥ 

psi 

h 

mu 

to, Q 

omega 


Table 2 

Other notation. 



=> Implies 3 There exists 

<= Implied by 3 There does not exist 

» If and only if V For all 


y 



Figure 1 Complex plane with z = a + ib = re 10 . 

(see figure 1). The corresponding diagram is called the 
Argand diagram. Important roles are played by the 
right half-plane { z : Rez ^ 0} and the left half-plane 
{z: Rez ^ 0}. If we exclude the pure imaginary num- 
bers (Imz = 0) from these sets we obtain the open 
half-planes. Euler’s formula, e 10 = cos 0 + isind, is 
fundamental. 

The polar form of a complex number is z = re 10 , 
where r ^ 0 and the argument arg z = 6 are real, and 
9 can be restricted to any interval of length 2tt, such as 
[0, 2tt) or (-tt, rr]. The complex conjugate of z = x+iy 
is z = x - i y, sometimes written z*. The modulus, or 
absolute value, \z\ = (zz) 1/2 = (x 2 + y 2 ) 1/2 = r. 


z 



Figure 2 Spherical coordinates. 

Complex arithmetic is defined in terms of real arith- 
metic according to the following rules, for Z\ = X\ +iyi 
and Z2 = X2 + i yg- 

zi ± z 2 = xi ± x 2 + i(yi ± y 2 ) i 
ziz 2 = xix 2 -y\yi + i(xiy 2 +x 2 yi), 
zi xix 2 + y\y 2 l .x 2 yi-x 1 y 2 

— 9 9 i 1 o 9 

22 x 2 + yi 

In polar form multiplication and division become nota- 
tionally simpler: if z\ = rie 101 and Z2 = r2e 102 then 
z\z 2 = r 1 r2e 1(01+02) andzi/z2 = (ri/r 2 )e u01 ~ 02 \ 

3 Coordinate Systems 

We are used to specifying a point in two dimensions by 
its x- and y-coordinates, and a point in three dimen- 
sions by its x-, y-, and z-coordinates. These are called 
Cartesian coordinates. In two dimensions we can also 
use polar coordinates, which are as described in the pre- 
vious section if we identify (x,y) withx + iy. Spherical 
coordinates, illustrated in figure 2, are an extension of 
polar coordinates to three dimensions. Here, (x,y,z) 
is represented by (r, 0, <fi), where 

x = r sin0cos(£, y = r sind shic^>, z = rcosd, 

with nonnegative radius r and angles 0 and <p in the 
ranges 0 ^ 0 ^ tt and 0 ^ <p < 2tt. 

Cylindrical coordinates provide another three-dimen- 
sional coordinate system. Here, polar coordinates are 
used in the x3- , -plane and z is retained, so ( x,y,z ) is 
represented by (r, 0 , z). 
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Table 3 Notation frequently used in this book. 


Notation 

Meaning 

Example 

R, € 

The real numbers, the complex numbers 


j^mxn 

The real w-vectors and real mxn matrices; similarly for C n and C mxn 


Rez, Imz 

Real and imaginary parts of the complex number z 


Z, N 

The integers, {0, ±1, ±2, . . . }, and the positive integers, {1,2,...} 


i - 1,2 ,...,n 

The integer variable i takes on the values 1, 2, 3, and so on, up to n; 
also written 1 ^ i ^ n and i = 1 : n 


~ 

Approximately equal; also written = 

7T ~ 3.14 

e 

Belongs to 

xeR,neZ 

= 

Identically equal to / = 0 means that / is the zero function, that is, 

/ is zero for all values, not just some values, of its argument(s) 


n\ 

Factorial, n\ = nin - 1) ■ ■ ■ 1 


- 

Tends to, or converges to 

n — oo 

I 

Summation 

Zi= 1 x t = Xi + x 2 + x 3 

n 

Product 

Uhxi = xix 2 x 3 

<K, » 

Much less than, much greater than 

n»l, 0^£«1 

5 U 

Kronecker delta: dp = 1 if i = j and 5y = 0 if ifj 


[a, b ], ia, b), [a, b ) 

The closed interval {x: a ^ x ^ b}, the open interval [x: a < x < bj, 
and the half-closed, half-open interval {x: a ^ x < b] 


f-P-'Q 

The function / maps the set P to the set Q, that is, x e P implies 
fix) e Q 



First, second, third, and kth derivatives of the function / 


fj 

First and second derivatives of the function f 


C[a, b] 

Real-valued continuous functions on [a, b] 

/ e C[a, b] 

C k [a,b] 

Real-valued functions with continuous derivatives of order 0, 1, . . . , k 
on [a, b ] 

f e C 2 [a,b] 

L 2 [a, b] 

The functions /: R — ■ R such that the Lebesgue integral j a fix) 2 dx 
exists 


f ° 9 

Composition of functions: if o g)(x) = figix)) 

e* 2 = e x 0 x 2 

: = > =: 

Definition of a variable or function, to distinguish from mathematical 
equality 

y’ = 1 + y 4 =: fiy) 


4 Functions 

A function f is a rule that assigns for each value of x 
a unique value f(x). It can be thought of as a black 
box that takes an input x and produces an output y = 
fix). A function is sometimes called a mapping. If we 
write y = fix) then y is the dependent variable and 
x is the independent variable, also called the argument 
off. 

For some functions there is not a unique value of 
fix) for a given x, and these multivalued functions are 
not true functions unless restrictions are imposed. For 
example, consider y = logx, which in general denotes 
any solution of the equation e y = x. There are infinitely 
many solutions, which can be written as y = yo + 2mk 
for fc G Z, where yo is the principal logarithm, defined 


as the logarithm whose imaginary part lies in (— tt, tt]. 
The principal logarithm is often the one that is needed 
in practice and is usually the one computed by soft- 
ware. Multivalued functions of a complex variable can 
be elegantly handled using riemann surfaces [IV. 1 §2] 
and branch cuts [IV. 1 §2]. 

A function is linear if the independent variable 
appears only to the first power. Thus the function 
fix) = ax + b, where a and b are constants, is lin- 
ear in x. In some contexts, e.g., in convex optimization, 
ax + b is called an affine function and the term linear is 
reserved for fix) = ax, for which fitx) = tfix) for 
all t. 

A function f is odd if fix) = -fi—x) for all x and 
it is even if fix) = /(-x) for all x. For example, the 
sine function is odd, whereas x 2 and |x| are even. 
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It is worth noting the distinction between the func- 
tion / and its value fix) at a particular point x. Some- 
times this distinction is blurred; for example, one might 
write “the function/(w, v),” in order to emphasize the 
symbols being used for the independent variables. 

Functions with more than one independent variable 
are called multivariate functions. For ease of notation 
the independent variables can be collected into a vec- 
tor. For example, the multivariate function f(u,v) = 
cos u sinv can be written fix) = cosxi sinX 2 , where 
x = [xi , X 2 ] T . 

5 Limits and Continuity 

The notion of a function converging to a limit as its 
argument approaches a certain value seems intuitively 
obvious. For example, the statement that x 2 — 4 as 
x — ■ 2, where the symbol ” means tends to or con- 
verges to, is clearly true, as can be seen by consider- 
ing the graph of x 2 . However, we need to make the 
notion of convergence precise because a large number 
of definitions are built on it. 

Let / be a real function of a real variable. We say that 
fix) — • f as x — ■ a, and we write lim X ~afix) = £, if 
for every e > 0 there is a 8 > 0 such that 0 < \x-a\ < 5 
implies !/(x) - £\ < s. In other words, by choosing x 
close enough to a, fix) can be made as close as desired 
to £. Showing that the definition holds in a particular 
case boils down to determining 5 as a function of e . 

It is implicit in this definition that 6 is finite. We say 
that fix) — • oo as x — a if for every p > 0 there is a 
5 > 0 such that \x - a \ <5 implies fix) > p. 

In practice, mathematicians rarely prove existence of 
a limit by exhibiting the appropriate 5 = 5 (e) in these 
definitions. For example, one would argue that tanx — 
oo as x — tt/ 2 because sinx — 1 and cosx — ■ 0 as 
x — tt/2. However, the definition might be used if / 
is an implicitly defined function whose behavior is not 
well understood. 

We can also define one-sided limits, in which the lim- 
iting value of x is approached from the right or the 
left. For the right-sided limit lim A -^ a + fix) = £, the def- 
inition of limit is modified so that 0 < |x — a\ < 5 
is replaced by a < x < a + S, and the left-sided 
limit limv-fl- fix) is defined analogously. The stan- 
dard limit exists if and only if the right- and left-sided 
limits exist and are equal. 

The function / is continuous at x = a if fia) exists 
and lim x - a fix) = fia). 


The definitions of limit and continuity apply equally 
well to functions of a complex variable. Here, the con- 
dition \x - a\ <5 places x in a disk of radius less than 
5 in the complex plane instead of an open interval on 
the real axis. 

The function / is continuous on [a, b] if it is contin- 
uous at every point in that interval. A more restricted 
form of continuity is Lipschitz continuity. The function 
/ is Lipschitz continuous on [a, b] if 

I fix) - fiy) | ^ I|x -y\ for all x,y e [a,b] 

for some constant L, which is called the Lipschitz con- 
stant. This definition, which is quantitative as opposed 
to the purely qualitative usual definition of continu- 
ity, is useful in many settings in applied mathemat- 
ics. A function may, however, be continuous without 
being Lipschitz continuous, as fix) = x 1/2 on [0,1] 
illustrates. 

A sequence «i, t? 2 , « 3 , ■ ■ ■ of real or complex num- 
bers, written {a n }, has limit c if for every f > 0 
there is a positive integer N such that \a n — c\ < s 
for all n ^ N. We write c = lim n ^oo a n . An infinite 
series Xjli converges if the sequence of partial sums 
Iltr a-i converges. 

6 Bounds 

In applied mathematics we are often concerned with 
deriving bounds for quantities of interest. For example, 
we might wish to find a constant u such that fix) ^ u 
for all x on a given interval. Such a u, if it exists, is 
called an upper bound. Similarly, a lower bound is a con- 
stant £ such that fix) ^ £ for all x on the interval. Of 
particular interest is the least upper bound, also called 
the supremum or sup, which is the smallest possible 
upper bound. The supremum might not actually be 
attained, as illustratedby the function/(x) = x/(l+x) 
on [0, oo), which has supremum 1. The infimum, or inf, 
is the greatest possible lower bound. 

A function that has an upper (or lower) bound is said 
to be bounded above (or bounded below). If the func- 
tion is bounded both above and below it is said to be 
bounded. A function that is not bounded is unbounded. 

Determining whether a certain function, perhaps a 
function of several variables or one defined in a func- 
tion space [11.15], is bounded can be nontrivial and it is 
often a crucial step in proving the convergence of a pro- 
cess or determining the quality of an approximation. 

Physical considerations sometimes imply that a func- 
tion is bounded. For example, a function that repre- 
sents energy must be nonnegative. 
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Figure 3 A convex function, illustrating the inequality (1). 

7 Sets and Convexity 

Three types of sets ini or C are commonly used in 
applied mathematics. 

An open set is a set such that for every point in the 
set there is an open disk around it lying entirely in the 
set. An open disk (or open ball) around a point a in R 
or C is the set of all points z satisfying \z - a\ < e 
for some specified e > 0. For example, {z e C: |z| < 
1} is an open disk. In R, an open disk reduces to an 
open interval (a - e , a + e ). A closed set is a set that is 
the complement of an open set; that is, it comprises all 
the points that are not in some open set. For example, 
{zeC:|z|^l}isa closed disk, the complement of 
the open set {z e C: |z| > 1}. 

A bounded set is one for which there is a constant M 
such that | x | ^ M for all x in the set. A set is compact 
if it is closed and bounded. 

A convex set is a set for which the line joining any two 
points x and y in the set lies in the set, that is, ax + 
( 1 - a)y is in the set for all a e [0, 1]. A related notion 
is that of a convex function. A real-valued function / is 
convex on a convex set S if 

flax + (1 - a)y) < af(x) + (1 - a)f(y) (1) 

for all x, y e S and a e [0, 1 ] . This inequality says that 
on the interval defined by x and y the function / lies 
below the line joining f(x) and fly) (see figure 3). An 
example of a convex function is f(x) = x 2 on the real 
line. A concave function is one satisfying (1) with the 
inequality reversed. 

8 Order Notation 

We write x » y to mean that x is approximately 
equal to y. The accuracy of the approximation may 
be implied by the context or the way y is written. For 


example, the statement that n ~ 3.14 implies that the 
approximation is correct to two decimal places. 

The big-oh and little-oh notations, O ( - ) and o ( - ) , are 
used to give information about the relative behavior of 
two functions. We write 

(i) /(z) = 0(g(z)) as z - oo (or z - 0) if |/(z)| ^ 
c\g(z)\ for some constant c for all sufficiently 
large |z| (or all sufficiently small |z|); 

(u) /(z) = o(g(z)) as z — oo (or z — 0) if 

f(z)/g(z) — ■ 0 as z — oo (or z — 0). 

In both cases, g is usually a well-understood function 
and / is a function whose behavior we are trying to 
understand. 

To illustrate: z 3 + z 2 + z + 1 = 0(z 3 ) as z — oo and 
z 3 + z 2 + z = O(z) as z — • 0, while z = o(e z ) as z — oo. 

Big-oh notation is frequently used when comparing 
the cost of algorithms measured as a function of prob- 
lem size. For example, the cost of multiplying two nxn 
matrices by the usual formulas is n 3 + q(n) additions 
and n 3 multiplications, where q is a quadratic function. 
We can say that matrix multiplication costs 2n 3 + O ( n 2 ) 
operations, or simply 0(n 3 ) operations. 

We write /(z) ~ g{z) (in words, “/(z) twiddles 
g(z)") if f(z) / g(z) tends to 1 as z tends to some quan- 
tity zo (sometimes the ratio is required only to tend to 
a finite, nonzero limit). For example, sinz ~ z as z — 0, 
Zi=i i 2 ~ n 2 /3 as n — ■ oo, and n! ~ f2nn(n/e) n 
as n — ■ oo, the last approximation, called Stirling’s 
approximation, being good even for small n. 


9 Calculus 


The rate of change of a quantity is a fundamental con- 
cept. The rate of change of the distance of a moving 
object from a given point is its speed, and the rate of 
change of speed is acceleration. In economics, inflation 
is the rate of change of a price index. The rate of change 
of a function is its derivative. Let / be a real function of 
a real variable. Intuitively, the rate of change of / at x is 
obtained by making a small change in x and taking the 
ratio of the corresponding change in f to the change in 
x, that is, (fix + e) — f(x))/E for small e. In order to 
get a unique quantity we take the limit as e — 0, which 
gives the derivative 


3 ^ = fix) = lim 
dx £-o 


fix + e) - fix) 


The derivative may or may not exist. For example, the 
absolute value function f(x) = \x\ is not differentiable 
at the origin because the left- and right-sided limits 
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Figure 4 The absolute value function, |.v|. 



Figure 5 The function fix) and the tangent g to / at 
x = a. The tangent is g(x) = f (a) (x - a) + f(a). 


are different: lim £- o -(f(x + e)-f(x))/e = -1 and 
lim £: _o + (/(* + e) - fix)) /s = 1 (see figure 4). Higher 
derivatives are defined by applying the definition recur- 
sively; thus fix) is the derivative of fix). 

A graphical interpretation of the derivative is that it 
is the slope of the tangent to the curve y = fix) (see 
figure 5). 

Another way to write the definition of derivative is as 

fix + e) - fix) - f'ix)s = oie). 

This definition has the benefit of generalizing naturally 
to function spaces [U.15], where it yields the Frechet 
derivative. 

A zero derivative identifies stationary points of a 
function, with the type of stationary point — maximum, 
minimum, or saddle point (also called a point of 
inflection)— being determined by the second and pos- 
sibly higher derivatives. This can be seen with the aid 



Figure 6 The function fix) = x 3 - x 4 with a saddle 
point at x = 0 and a maximum at x = 3/4. 


of a Taylor series about the point a of interest: 
fix) = fia) + f'ia)ix - a) + /" (a ) - ; fl) 


If fia) = 0 and f" la) f 0 then, for x sufficiently 
close to a, fix) ~ fia) + f" (a)(x - a) 2 / 2, and so a is 
a maximum point if f" la) <0 and a minimum point if 
fia) > 0. If fia) = fia) = 0 then we need to look 
at the higher-order derivatives to determine the nature 
of the stationary point; in particular, if f" ia) f 0 then 
a is a saddle point (see figure 6). 

The error in truncating a Taylor series is captured in 
the Taylor series with remainder: 

/(*)= I f (k) («) (X ^ + / (w+1) (5) iX ( y . 

where 5 is an unknown point on the interval with end- 
points a and x. For n = 0 this reduces to fix)- fia) = 
fi^)ix - a), which is the mean-value theorem. 

Rigorous statements of results must include assump- 
tions about the smoothness of the functions involved, 
that is, how many derivatives are assumed to exist. For 
example, the Taylor series with remainder is valid if f is 
(n+ 1 Mimes continuously differentiable on an interval 
containing x and a. In applied mathematics we often 
avoid clutter by writing “for smooth functions /” to 
indicate that the existence of continuous derivatives up 
to some order is assumed. Underlying such a statement 
might be some known minimal assumption on /, or just 
the knowledge that the existence of continuous deriva- 
tives of all orders is sufficient and that less restrictive 
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conditions can usually be derived if necessary. Some- 
times, when deriving or using results, it is not possible 
to check smoothness conditions and one simply car- 
ries on anyway (“making compromises,” as mentioned 
in the quote by Courant on page 1). It may be possible 
to verify by other means that an answer obtained in a 
nonrigorous way is valid. 

For a function f(x,y) of two variables, partial 
derivatives with respect to each of the two variables are 
defined by holding one variable constant and varying 
the other: 


9/ = lim f(x + E,y) - f(x,y) 
dx £-0 £ 


9/ = lim f(x,y + a) -f(x,y) 
dy £-0 £ 


Higher derivatives are defined recursively. For example, 

a 2 / 


- , = lim 

3x 2 £-o 


%t(x + E,y)- %(x,y) 


d 2 / = , + £)- ||(y,r) 

dxdy £-o £ 


a 2 / 

t = lim 

dydx £-o 


%(x + s,y) 




Common abbreviations are f x = df/ dx, f X y = 
d 2 f/(dxdy), f yv = d 2 f/dy 2 , and so on. As long as 
they are continuous the two mixed second-order partial 
derivatives are equal: f xy = f vx . 

For a function of n variables, F: R n — R, a Taylor 
series takes the form, for x, a e R n , 


F(x) = F(a) + VF(a) J (x - a) 

+ \(x - a) J V 2 F(a)(x - a) + ■ ■ ■ , 

where VF(x) = ( dF/dXj ) e R n is the gradient vector 
and V 2 F(x) = (d 2 F / (dxidxj)) e R ftx '" is the symmet- 
ric Hessian matrix , with xj denoting the jth component 
of the vector x. The symbol V is called nabla. Stationary 
points of F are zeros of the gradient and their nature 
(maximum, minimum, or saddle point) is determined 
by the eigenvalues of the Hessian (see continuous 
optimization [IV.ll §2]). 

Now we return to functions of a single (real) vari- 
able. The indefinite integral of f(x) is f fix) dx, while 
integrating between limits a and b gives the definite 
integral 1% fix) dx. The definite integral can be inter- 
preted as the area under the curve fix) between a 
and b. The inverse of differentiation is integration, as 
shown by the fundamental theorem of calculus, which 


states that, if f is continuous on [a, b], then the func- 
tion g(x) = fa fit) dt is differentiable on (a,b) and 
g' (x) = fix). Generalizations of the fundamental the- 
orem of calculus to functions of more than one variable 
are given in section 24. 

For functions of two or more variables there are 
other kinds of integrals. When there are two variables, 
x and y, we can integrate over regions in the xy- 
plane (double integrals) or along curves in the plane 
(line integrals). For functions of three variables, x, y, 
and z, there are more possibilities. We can integrate 
over volumes (triple integrals) or over surfaces or along 
curves within xyz-space. As the number of variables 
increases, so does the number of different kinds of inte- 
grals. Multidimensional calculus shows how these dif- 
ferent integrals can be calculated, used, and related. 
The number of variables can be very large (e.g., in math- 
ematical finance) and the curse of dimensionality 
[1.3 §2] poses major challenges for numerical evalua- 
tion. Numerical integration in more than one dimen- 
sion is an active area of research, and Monte Carlo 
methods and quasi-Monte Carlo methods are among 
the methods in use. 

The product rule gives a formula for the derivative of 
a product of two functions: 

:?-J'(x)g(x) = f (x)g(x) + f(x)g'(x). 

dx 

Integrating this equation gives the rule for integration 
by parts'. 

| f(x)g'(x) dx = f(x)g(x) - J f(x)g(x) dx. 

In many problems functions are composed: the argu- 
ment of a function is another function. Consider the 
example fix) = g(h(x)). We would hope to be able to 
determine the derivative of f in terms of the deriva- 
tives of g and h. The chain rule provides the necessary 
formula: /'(x) = h 1 (x)g' (h(x)). An equivalent formu- 
lation is that, if / is a function of u , which is itself a 
function of x, then 

df _ df du 
dx du dx " 

For example, if fix) = sinx 2 then with fix) = sin it 
and u = x 2 we have d//dx = 2x cosx 2 . 

10 Ordinary Differential Equations 

A differential equation is an equation containing one or 
more derivatives of an unknown function. It provides a 
relation among a function, its rate of change, and (pos- 
sibly) higher-order rates of change. The independent 
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variable usually represents a spatial coordinate (x) or 
time (t). The differential equation maybe accompanied 
by additional information about the function, called 
boundary conditions or initial conditions, that serve to 
uniquely determine the solution. A solution to a differ- 
ential equation is a function that satisfies the equation 
for all values of the independent variables (perhaps in 
some region) and also satisfies the required boundary 
conditions or initial conditions. A differential equation 
can express a law of motion, a conservation law, or con- 
centrations of constituents of a chemical reaction, for 
example. 

Ordinary differential equations (ODEs) contain just 
one independent variable. The simplest nontrivial ODE 
is dy/dt = ay, where y = y (t) is a function of t. This 
equation is linear in y and it is first order because only 
the first derivative of y appears. The general solution 
is y(t) = ce at , where c is an arbitrary constant. To 
determine c, some value of y must be supplied, say 
y(0) = 3'o, whence c = yo- 

A general first-order ODE has the form y' = f(t,y) 
for some function / of two variables. The initial-value 
problem supplies an initial condition and asks for y at 
later times: 

y'=f(t,y), a^t^b, y(a) = y a . 

A specific example is the Riccati equation 

y' = t 2 +y 2 , 0 < t < 1, y(0) = 0, 

which is nonlinear because of the appearance of y 2 . 

For an example of a second-order ODE initial-value 
problem, that is, one involving y", consider a mass m 
attached to a vertical spring and to a damper, as shown 
in figure 7. Let y = y (t) denote how much the spring 
is stretched from its natural length at time t. Balancing 
forces using Newton’s second law (force equals mass 
times acceleration) and hooke’s law [III. 15] gives 

my" = mg - ky - cy' , 

where k is the spring constant, c is the damping con- 
stant, and g is the gravitational constant. With pre- 
scribed values for y(0) and y'(0) this is an initial- 
value problem. More generally, the spring might also 
be subjected to an external force fit), in which case 
the equation of motion becomes 

my" + cy' + ky = mg + fit). 

Second-order ODEs also arise in electrical networks. 
Consider the flow of electric current lit) in a simple 
RLC circuit composed of an inductor with inductance 



Figure 7 A spring system with damping. 


R 



Figure 8 A simple RLC electric circuit. 


L, a resistor with resistance R, a capacitor with capac- 
itance C, and a source with voltage v$, as illustrated 
in figure 8. The Kirchhoff voltage law states that the 
sum of the voltage drops around the circuit equals the 
input voltage, vs- The voltage drops across the resis- 
tor, inductor, and capacitor are RI, Ldl/dt, and Q/C, 
respectively, where Q(t) is the charge on the capacitor, 
so 

+ RI + Q = v s it). 
dt C 


Since I = dQ/d t, this equation can be rewritten as the 
second-order ODE 


L 


d 2 Q 

dt 2 


■K§ + sQ-vsU). 


The unknown function y may have more than one 
component, as illustrated by the predator-prey model 
derived by Lotka and Volterra in the 1920s. In a pop- 
ulation of rabbits (the prey) and foxes (the predators) 
let r(t) be the number of rabbits at time t and fit) the 
number of foxes at time t. The model is 
(It 

— = r - <xrf, r(0) = r 0 , 
at 

df = ~f + ar f’ f (0) = f°- 


The rf term represents an interaction between the 
foxes and the rabbits (a fox eating a rabbit) and the 
parameter a ^ 0 controls the amount of interaction. 
For a = 0 there is no interaction and the solution is 
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r(t) = roe c , fit) = /oe~ ( : the foxes die from starva- 
tion and the rabbits go forth and multiply, unhindered. 
The aim is to investigate the behavior of the solutions 
for various parameters ot and starting populations tq 
and f 0 . 

As we have described it, the predator-prey model has 
the apparent contradiction that r and / are integers by 
definition yet the solutions to the differential equation 
are real-valued. The way around this is to assume that 
r and / are large enough for the error in representing 
them by continuous variables to be small. 

A boundary-value problem specifies the function at 
more than one value of the independent variable, as in 
the two-point boundary-value problem 

y" = f(t,y,y'), a^t^b, y(a) = y a , y(b) = y b . 
An example is the Thomas-Fermi equation 

y" = t~ 1/2 y 3/2 , y( 0) = 1 , y(°°) = 0, 

which arises in a semiclassical description of the charge 
density in atoms of high atomic number. Another exam- 
ple, this time of third order, is the blasius equation 
[IV.28 §7.2] 

2y'"+yy" = 0, y(0) = y'(0) = 0, 3/(00) = l, 

which describes the boundary layer in a fluid flow. 

A special type of ODE boundary -value problem is the 
Sturm-Liouville problem 

-(p(x)y'(x))' + q(x)y(x) = A r(x)yix), 
x 6 [ a,b ], y(a) = y{b) = 0. 

This is an eigenvalue problem, meaning that the aim 
is to determine values of the parameter A for which 
the boundary-value problem has a solution that is not 
identically zero. 

1 1 Partial Differential Equations 

Many important physical processes are modeled by par- 
tial differential equations (PDEs): differential equations 
containing more than one independent variable. We 
summarize a few key equations and basic concepts. 
We write the equations in forms where the unknown u 
has two space dimensions, u = u(x,v), or one space 
dimension and one time dimension, u = u{x, t). Where 
possible, the equations are given in parameter-free 
form, a form that is obtained by the process of non- 
dimensionalization [II.9]. Recall the abbreviations 
u t = du/dt, u xx = 3 2 u/dx 2 , etc. 

LAPLACE’S EQUATION [III. 18] is 

U xx T U-yy — 0. 


The left-hand side of the equation is the Laplacian of 
u, written Alt. This equation is encountered in electro- 
statics (for example), where u is the potential function. 
The equation Au = /, for a given function f(x,y), is 
known as Poisson’s equation. 

To define a problem with a unique solution it is nec- 
essary to augment the PDE with conditions on the solu- 
tion: either boundary conditions for static problems 
or, for time-dependent problems, initial conditions. In 
the former class there are three main types of bound- 
ary conditions, with the problem being to determine u 
inside the boundary of a closed region. 

• Dirichlet conditions, in which the function u is 
specified on the boundary. 

• Neumann conditions, where the inner product (see 
section 19.1) of the gradient 

Vm = [du/dx, du/dy] 1 

with the normal to the boundary is specified. 

• Cauchy conditions, which comprise a combination 
of Dirichlet and Neumann conditions. 

For time-dependent problems, which are known as evo- 
lution problems and represent equations of motion, 
initial conditions at the starting time, usually taken to 
be t = 0, are needed, the number of initial conditions 
depending on the highest order of time derivative in 
the PDE. 

The wave equation [III.31] is 

Uft = tlxx- 

It describes linear, nondispersive propagation of a 
wave, represented by the wave function u , e.g., a vibrat- 
ing string. Two initial conditions, prescribing u(x,0) 
and utix, 0), for example, are needed to determine u. 

The heat equation [III. 8] ( diffusion equation) is 

U t U X x i (2) 

which describes the diffusion of heat in a solid or the 
spread of a disease in a population. An initial condition 
prescribing u at t = 0 is usual. When a term fix, t, u) 
is added to the right-hand side of (2) the equation 
becomes a reaction-diffusion equation. 

The advection-diffusion equation is 

Ut + vu x = u xx , 

where v is a given function of x and t. Again, u is usu- 
ally given at t = 0. For v = 0 this is just the heat equa- 
tion. This PDE models the convection (or transport) of 
a quantity such as a pollutant in the atmosphere. 
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The general linear second-order PDE 

au xx + 2bu x t + cu tt = fix, t, u, u x , u t ) (3) 

is classified into different types according to the (con- 
stant) coefficients of the second derivatives. Let d = 
ac - b 2 , which is the determinant of the symmetric 
matrix [ f, b c ] . 

• If d > 0 the PDE is elliptic. These PDEs, of which 
the Laplace equation is a particular case, are asso- 
ciated with equilibrium or steady-state processes. 
The independent variables are denoted by x and 
y instead of x and t. 

• If d = 0 the PDE is parabolic. This is an evolution 
problem governing a diffusion process. The heat 
equation is an example. 

• If d < 0 the PDE is hyperbolic. This is an evolution 
problem, governing wave propagation. The wave 
equation is an example. 

Some elliptic PDEs and parabolic PDEs have maxi- 
mum principles, which say that the solution must take 
on its maximum value on the boundary of the domain 
over which it is defined. 

In (3) we took a, b, and c to be constants, but they 
may also be specified as functions of x and t, in which 
case the nature of the PDE can change as x and t vary 
in the domain. For example, the tricomi equation 
[III.30] 

U X x + XUyy = 0 

is hyperbolic for x < 0, elliptic for x > 0, and parabolic 
for x = 0. 

The PDEs stated so far are all linear. Nonlinear PDEs, 
in which the unknown function appears nonlinearly, 
are of great practical importance. Examples are the 
KORTEWEG-DE VRIES EQUATION [III. 16] 

U t + UU X + U X xx = 0, 
the CAHN-HILLIARD EQUATION [III. 5] 

U t = A(-U + M 3 + e 2 Au), 
and Fisher’s equation 

Ut = u xx + - u), 

a reaction-diffusion equation that describes pattern 
formation [IV.27] and the propagation of genes in a 
population. 

PDEs also occur in the form of eigenvalue prob- 
lems. A famous example is the eigenvalue problem 
corresponding to the Laplace equation: 

Au + Ait = 0 


on a membrane Q, with boundary conditions that u 
vanishes on the boundary of 13. A nonzero solution 
u is called an eigenfunction and A is the correspond- 
ing eigenvalue. In a 1966 paper titled “Can one hear 
the shape of a drum?” Mark Kac asked the question of 
whether one can determine Q given all the eigenvalues. 
In other words, do the frequencies at which a drum 
vibrates uniquely determine its shape? It was shown 
in a 1992 paper by Gordon, Webb, and Wolpert that the 
answer is no in general. 

Higher-order PDEs also arise. For example, fluid 
dynamics problems involving surface tension forces 
are generally modeled by PDEs in space and time with 
fourth-order derivatives in space. The same is true of 
the Euler-Bernoulli equation for a beam, which has the 
form 


pA- 


d 2 i 


- EI- 


3 4 i 


■ f(x,t), 


where u(x, t) is the vertical displacement of the beam 
at time t and position x along the beam, p is the density 
of the beam, A its cross-sectional area, E is Young’s 
modulus, I is the second moment of inertia, and /(x, t ) 
is an applied force. 


12 Other Types of Differential Equations 

Delay differential equations are differential equations 
in which the derivative of the unknown function y at 
time t (in general, a vector function) depends on past 
values of y and/or its derivatives. For example, y' ( l ) = 
Ay(t - 1) is a delay differential equation analogue of 
the familiar y' (t) = Ay(t). Looking for a solution of 
the form y(t) = e wt leads to the equation u<e w = A, 
whose solutions are given by the Lambert W function 
[in. 17], 

integral equations [IV.4] contain the unknown 
function inside an integral. Examples are Fredholm 
equations, which are of the form either 

f K(x,y)f(y) Ay = g(x), 

Jo 

where K and g are given and the task is to find f , or 

A f K(x,y)f(y)Ay + g(x) = f(x), 

Jo 

where A is an eigenvalue and again / is unknown. These 
two types of equations are analogous to a matrix lin- 
ear system Kf = g and an eigenvalue problem (I - 
A K)f = g, respectively. Integro-differential equations 
involve both integrals and derivatives (see, for example, 
modeling a pregnancy testing kit [VII. 18 §2]). 
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Fractional differential equations contain fractional 
derivatives. For example, (d/dx) 1/2 is defined to be an 
operator such that applying (d/dx) 1/2 twice in succes- 
sion to a function /(x) is the same as differentiating it 
once (that is, applying d/dx). 

Differential-algebraic equations (DAEs) are systems 
of equations that contain both differential and alge- 
braic equations. For example, the DAE 
x" = -2Ax, 
y" = -2A y-g, 
x 2 + y 2 = L 2 

describes the coordinates of an infinitesimal ball of 
mass 1 at the end of a pendulum of length L, where 
g is the gravitational constant and A is the tension in 
the rod. DAEs often arise in the form My' = f(t,y), 
where the matrix M is singular. 

13 Recurrence Relations 

Recurrence relations are the discrete counterpart of 
differential equations. They define a sequence Xo,Xi, 
X 2 ,... recursively, by specifying x n in terms of ear- 
lier terms in the sequence. Such equations are also 
called difference equations, as they arise when deriva- 
tives in differential equations are replaced by finite 
DIFFERENCES [11.11]. 

A famous recurrence is the three-term recurrence 
that defines the Fibonacci numbers: 

fn = fn-1 + fn- 2, n ^ 2, f 0 = f\ = 1. 

This recurrence has the explicit solution f n = (<fi n - 
(-4>)~ n ) lo/5, where </> = (1 + V5)/2 is the golden 
ratio. An example of a two-term recurrence is f(n) = 
nf(n - 1), with /( 0) = 1, which defines the factorial 
function /(n) = n\. Both the examples so far are linear 
recurrences, but in some recurrences the earlier terms 
appear nonlinearly, as in the logistic recurrence 
[III. 19] x n +i = px„(l - x„). 

Although one can evaluate the terms in a recur- 
rence one often needs an explicit formula for the gen- 
eral solution of the recurrence. Recurrence relations 
have a theory analogous to that of differential equa- 
tions, though it is much less frequently encountered in 
courses and textbooks than it was fifty years ago. 

The elements in a recurrence can be functions as well 
as numbers. Most transcendental functions that carry 
subscripts satisfy a recurrence. For example, the bessel 
function [III.2] J n (x) of order n satisfies the three- 
term recurrence 

2.71 

Jn+l(x ) = J n (x) - J n -i(x). 

X 


An important source of three-term recurrences is 
ORTHOGONAL POLYNOMIALS [11.29]. 

14 Polynomials 

Polynomials are one of the simplest and most familiar 
classes of functions and they find wide use in applied 
mathematics. A degree-n polynomial 

p n (x) = ao + a\x + ■ ■ ■ + a n x n 

is defined by its n + 1 coefficients ao, ... ,a n 6 C (with 
a n f 0). Addition of two polynomials is carried out by 
adding the corresponding coefficients. Thus, if q n (x) = 
bo + b\X + ■ ■ ■ + b n x n then p n (x) + q n (x) = ao + 
bo + (a i + bi)x + ■ ■ ■ + ( a„ + b n )x n . Multiplication is 
carried out by expanding the product term by term and 
collecting like powers of x: 

Pn(x)q n (x) = a 0 b 0 + (a 0 b i + aib 0 )x + ■ ■ ■ 

+ (a 0 b n + aibn-i + ■ ■ ■ + a n b 0 )x n . 

The coefficient of x n , Xl=o a ib n -i< is the convolu- 
tion of the vectors a = [ao, ai, ■ ■ ■ , a n ] T and b = 
[ho, hi, ... , b n ] T . Polynomial division is also possible. 
Dividing p n by q m with m ^.n results in 

Pn(x) = q m (x)g(x) +r(x), (4) 

where the quotient g and remainder r are polynomials 
and the degree of r is less than that of q m . 

The fundamental theorem of algebra says that a 
degree-n polynomial p„ has a root in C; that is, there 
exists Zi e C such that p n (zi ) = 0. If we take q m (x) = 
x-zi in (4) then we have p n (x) = (x - z\)g(x) +r(x), 
where degr < 1, so r is a constant. But setting x = zi 
we see that 0 = p n (z\) = r, so p n (x) = (x - z\)g{x) 
and g clearly has degree n - 1. Repeating this argu- 
ment inductively on g, we end up with a factorization 
p n (x) = (x - Zi)(x - Z 2 ) ■ ■ ■ (x - z n ), which shows 
that p n has n roots in C (not necessarily distinct). If 
the coefficients of p n are real it does not follow that the 
roots are real, and indeed there may be no real roots at 
all, as the polynomial x 2 + 1 shows; however, nomeal 
roots must occur in complex conjugate pairs Xj ± i yj. 

Three basic problems associated with polynomials 
are as follows. 

Evaluation: given the polynomial (specified by its coef- 
ficients), find its value at a given point. A standard 
way of doing this is horner's method [1.4 §6]. 
Interpolation: given the values of a degree-n polyno- 
mial at a set of n + 1 distinct points, find its coeffi- 
cients. This can be done by various interpolation 
schemes [1.3 §3.1]. 
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Root finding: given the polynomial and the ability to 
evaluate it, find its roots. This is a classic problem 
with a vast literature, including methods specific to 
polynomials and specializations of general-purpose 
nonlinear equation solvers. 


1 5 Rational Functions 


A rational function is the ratio of two polynomials: 

r mn (x) = ^4 = SrC- , a m ,b n t 0. 
q n (x) Xf =0 bjX l 

Rational functions are more versatile than polynomials 
as a means of approximating other functions. As x 
grows larger, a polynomial of degree 1 or higher neces- 
sarily blows up to infinity. In contrast, a rational func- 
tion r mm with equal-degree numerator and denomi- 
nator is asymptotic to a m /b m , as x — ■ oo, while for 
m < n, r mn (x) converges to zero as x — oo. Moreover, 
a rational function has poles: certain finite values of x 
for which it is infinite (the roots of the denominator 
polynomial q n ). 

The representation of a rational function as a ratio 
of polynomials is just one of several possibilities. We 
can write r mn in partial fraction form, for example. If 
m < n and q n has distinct roots Xi , . . . , x n , then 


t'mn(x) — ^ ( 5 ) 

M*-*' 

for some Ci, . . . , c n . One reason to put a rational func- 
tion in partial fraction form is in order to integrate it, 
since the integral of (5) is immediate: fr mn (x) dx = 
Ya=\ Cjlog \x - Xi | + C, where C is a constant. 

An important class of rational functions is the Pade 
approximants to a given function f, which are defined 
by the property that r mn (x) - /(x) = 0(x k ) with fc 
as large as possible. Since r mn has m + n + 1 degrees 
of freedom (one having been lost due to the division), 
generically k = m + n + 1, but k can be smaller or 
larger than this value (see approximation theory 
[IV.9 §2.4]). When m = 0, a Pade approximant reduces 
to a truncated Taylor series. 


16 Special Functions 

Applied mathematicians make much use of functions 
that are not polynomial or rational, though they may 
ultimately use polynomial or rational approximations 
to such functions. A larger class of functions is the 
elementary functions, which are made up of polyno- 
mials, rationals, the exponential, the logarithm, and 


all functions that can be obtained from these by addi- 
tion, subtraction, multiplication, division, composition, 
and the taking of roots. Another important class is 
the transcendental functions: those that are not alge- 
braic, that is, that are not the solution /(x) of an equa- 
tion p(x,/(x)) = 0, where p(x,y) is a polynomial in 
x and y with integer coefficients. Examples of tran- 
scendental functions include the exponential, the loga- 
rithm, the trigonometric functions, and the hyperbolic 
functions. 

In solving problems we talk about the ability to obtain 
the solution in closed form, which is an informal con- 
cept meaning that the solution is expressed in terms of 
elementary functions or functions that are “well under- 
stood,” in that they have a significant literature and 
good algorithms exist for computing them. 

The special functions [IV. 7] provide a large set of 
examples of well-understood functions. They arise in 
different areas, such as physics, number theory, and 
probability and statistics, often as the solution to a 
second-order ODE or as the integral of an elementary 
function. A general example is the hypergeometric 
function [IV. 7 §5] 


F(a,b\c\x) 


1) 2 , 
— x z + ■ 


, ab a(a + 1 )b(b + 

1 + V X+ 2!c(c + 1) 

_ ^ (a)i(b)i f 

~ to ^ ' 

Here, a, b, c 6 M, c is not zero or a negative integer, 
and ( a)i = a(a + 1) - - ■ (a + i - 1) for i ^ 1, with 
(a ) o = 1. The hypergeometric function is a solution 
of the second-order differential equation 


x(l-x)u'"(x) + (c-(a + b + l)x)'u/(x)-afc'U'(x) = 0. 


The hypergeometric functions contain many interest- 
ing special cases, such as F(a, b; b;x) = (1 - x)~ a and 
F(l, 1; 2;x) = -x _1 log(l - x). 

Other special functions include the following. 


• The error function 

2 r x ■> 

erf(x) = — e~ f dt, 

Vff Jo 

which is closely related to the standard normal 
distribution in probability and statistics. 

• The gamma function [III. 13] 

r OO 

F(z) = e _t t z_1 dt, 

Jo 

which satisfies F(n) = (n - 1)! for positive inte- 
gers n and so generalizes the factorial function. 
Note that the argument z is a complex number. 
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Figure 9 The gamma function in the complex plane. The 
height of the surface is |T(z)|. The function has poles at 
the negative integers; in this plot the infinite peaks have 
been truncated at different heights. 


Figure 9 is modeled on a classic, hand-drawn plot 
of the gamma function in the complex plane from 
the book Tables of Functions with Formulas and 
Curves by Eugene Jahnke and Fritz Emde, first 
published in 1909. 

• BESSEL FUNCTIONS [III.2], the LAMBERT W FUNC- 
TION [III. 17], elliptic functions, and the riemann 
ZETA FUNCTION [IV.7 §4]. 

The class of special functions can be enlarged by 
identifying useful functions, giving them a name, 
studying their properties, and deriving algorithms and 
software for evaluating them. Of the examples men- 
tioned above, the most recent is the Lambert W func- 
tion, whose significance was realized, and to which the 
name was given, only in the 1990s. 

1 7 Power Series 

A power series is an infinite expansion of the form 

ao + a\Z + a 2 z 2 + «3Z 3 + ■ ■ ■ , 

where z is a complex variable and the a, are complex 
constants. Results from complex analysis [IV.l §5] 
tell us that such a series has a radius of convergence 
R such that the series converges for |z| < R, diverges 
for |z| > R, and may either converge or diverge for 
| z | = R. For example, the power series 1 + z + z 2 + ■ ■ ■ 
converges for | z | < 1 , and inside this disk it agrees with 


the function /(z) = (1 - z) _1 . More generally, a power- 
series expansion can be taken about an arbitrary point 
z 0 : a 0 + ai(z- z 0 ) + a 2 (z - z 0 ) 2 + a 3 (z - z 0 ) 3 + ■ ■ ■ . 

Some functions have power series with an infinite 
radius of convergence, R = oo. Perhaps the most 
important example is the exponential: 


Suppose a function f has a power-series expansion 
/(z) = ao + a\z + a 2 z 2 + a 2 z 3 + ■ ■ ■ . Then /( 0) = ao 
and differentiating gives f'(z) = a i + 2a 2 z + 3ci3Z 2 + 
■ ■ ■ and, hence, on setting z = 0, ai = /'( 0). What 
we have just done is to differentiate this infinite series 
term by term, something that in general is of dubious 
validity but in this case is justified because a power 
series can always be differentiated term by term within 
its radius of convergence. Continuing in this way we 
find that all the a ^ are derivatives of / evaluated at the 
origin and the expansion can be written as the Taylor 
series expansion 

/(z) = /( 0) + /'(0)z + /"(0)^- + + 


18 Matrices and Vectors 

A matrix is an m x n (read as “m-tay-n”) array of real 
or complex numbers, written as 


a li 

«12 

&ln 

a 2 i 

a 22 

&2n 

Mml 

a m2 



The element at the intersection of row i and column j 
is atj. The matrix is square if m = n and rectangular 
otherwise. A vector is a matrix with one row or column: 
an to x 1 matrix is a column vector and alxn matrix is 
a row vector. A number is often referred to as a scalar 
in order to distinguish it from a vector or matrix. 

The sets ofmxn matrices and n x 1 vectors over 
R are denoted by R mx ” and R”, respectively, and 
similarly for C. 

A notation that is common, though not ubiquitous, 
in applied mathematics employs uppercase letters for 
matrices and lowercase letters for vectors or, when 
subscripted, matrix elements. Similarly, matrices or 
vectors are sometimes written in boldface. 

What distinguishes a matrix from a mere array of 
numbers is the algebraic operations defined on it. 
For two matrices A, B of the same dimensions, addi- 
tion is defined element-wise: C = A + B means that 
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Cij = ctij + bij for all i and j. Multiplication by a scalar 
is defined in the natural way, so C = <xA means that 
Cij = <xciij for all i and j. However, matrix multiplica- 
tion is not defined element-wise. If A is m x r and B is 
r xn then the product C = AB is mxn and is defined 
by 

r 

Cij = ^ ciikbkj. 
k= l 

This formula can be obtained as follows. Write B = 
[b 1 , b 2 , . . . , b n ], where b J is the jth column of B\ this 
is a partitioning of B into its columns. Then AB = 
A[b x , b 2 , . . . , b n ] = [Ab 1 , Ab 2 ,..., Ab n ], where each 
Ab 1 is a matrix-vector product. Matrix-vector products 
Ax with x an r x 1 vector are in turn defined by 


Ax = [a 1 , a 2 a r ] 


XL 

X2 

X r 


Xia 1 +X2a 2 - 


-x r d 


so that Ax is a linear combination of the columns of A. 

Matrix multiplication is not commutative: AB f BA 
in general, as is easily checked for 2x2 matrices. In 
some contexts the commutator (or Lie bracket ) [A, B] = 
AB - BA plays a role. 

A linear system Ax = b expresses the vector b as 
a linear combination of the columns of A. When A is 
square and of dimension n, this system provides n lin- 
ear equations for the n components of x. The system 
has a unique solution when A is nonsingular, that is, 
when A has an inverse. An inverse of a square matrix 
A is a matrix A -1 such that AA _1 = A _1 A = /, where 
I is the identity matrix, which has ones on the diago- 
nal and zeros everywhere else. We can write I = (<5y), 
where 5ij is the Kronecker delta defined in table 3. The 
inverse is unique when it exists. If A is nonsingular then 
x = A _1 h is the solution to Ax = b. While this formula 
is useful mathematically, in practice one almost never 
solves a linear system by inverting A and then multiply- 
ing the right-hand side by the inverse. Instead, Gauss- 
ian elimination [IV.10 §2] with some form of pivoting 
is used. 

T ransposition turns an mxn matrix into annxtn one 
by interchanging the rows and columns: C = A T -^=> 
Cij = aji for all i and j. Conjugate transposition also 
conjugates the elements: C = A* «=*■ dj = a.ji for all 
i and j. The conjugate transpose of a product satisfies 
a useful reverse-order law: (AB)* = B*A*. 

Matrices can have a variety of different structures 
that can be exploited both in theory and in computa- 
tion. A matrix A e R nxn is upper triangular if ay = 0 


for i > j, lower triangular if A T is upper triangular, and 
diagonal if ay = 0 for i f j. For n = 3, such matrices 
have the forms 


X 

X 

X 


X 

0 

0~ 


di 

0 

o~ 

0 

X 

X 

, 

X 

X 

0 

, 

0 

d.2 

0 

0 

0 

X 


X 

X 

X 


0 

0 

do 


respectively, where x denotes a possibly nonzero entry; 
the third matrix is abbreviated diag(dj, d 2 , d^). The 
matrix A e R nxn is symmetric if A T = A, while 
A e C nxn is Hermitian if A* = A. If in addition 
the quadratic form x T Ax (or x* Ax) is always pos- 
itive for nonzero vectors in R n (or C”), then A is 
positive-definite. The term self-adjoint is sometimes 
used instead of symmetric or Hermitian. Also funda- 
mental is the notion of orthogonality: A e R nxn is 
orthogonal if A T A = I, and A e C nxn is unitary if 
A* A = I. These properties mean that the inverse of 
A is its (conjugate) transpose, but deeper properties 
of unitary matrices such as preservation of angles, 
norms, etc., under multiplication are what make them 
so important. 

Structures can correspond to the pattern of the ele- 
ments. A Toeplitz matrix has constant diagonals, made 

up from 2n - 1 parameters ay i = -(n — 1) n - 1. 

Thus a 5 x 5 Toeplitz matrix has the form 


a o 

CL i 

CL 2 

£*3 

CL 4 

CL - 1 

a o 

CL\ 

CL 2 

ao 

CL -2 

CL—i 

ao 

CL\ 

CL 2 

fl-3 

CL- 2 

CL-i 

ao 

CL\ 

Cl -4 

fl- 3 

CL -2 

CL - 1 

ao 


Toeplitz matrices arise in signal processing [IV.35]. 
A circulant matrix is a special type of Toeplitz matrix 
in which each row is a cyclic permutation (one ele- 
ment to the right) of the row above. Circulant matrices 
have many special properties, including that an explicit 
formula exists for their inverses and their eigenvalues. 

A Hamiltonian matrix is a 2 n x 2n matrix of the block 
form 

A F 

G -A* J ’ 

where A, F, and G are n x n matrices and F and G 
are Hermitian. Hamiltonian matrices play an important 
role in control theory [III.25]. 

The determinant of an n x n matrix A is a scalar that 
can be defined inductively by 

n 

det(A) = X(- 1 ) i+J «ydet(Ay) 
j=i 
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for any i e {1,2 n}, where Ay denotes the (n- 1) x 

(n - 1) matrix obtained from A by deleting row i and 
column j, and det(a) = a for a scalar a. This formula 
is called the expansion by minors because det(A;y) is 
a minor of A. The determinant is sometimes written 
with vertical bars, as | A | . Although determinants came 
before matrices historically, determinants have only a 
minor role in applied mathematics. 

The quantity obtained by modifying the definition 
of determinant to remove the (-l) t+J term is the per- 
manent, which is the sum of all possible products of 
n elements of A in which exactly one is taken from 
each row and each column. The permanent arises in 
combinatorics and in quantum mechanics. 

19 Vector Spaces and Norms 

A vector space is a mathematical structure in which a 
linear combination of elements can be taken, with the 
result remaining in the vector space. A vector space V 
has a binary operation, which we will write as addition, 
that is associative, is commutative, and has an identity 
(the “zero vector,” written 0) and additive inverses. In 
other words, for any a,b,c 6 V we have (a + b) + c = 
a + (b + c), a + b = b + a, a + 0 = a, and there is a 
d such that a + d = 0. There is also an underlying set 
of scalars, R or C, such that V is closed under scalar 
multiplication. Moreover, for all x,y e V and scalars 
a, p we have a(x +y) = ax + ay, (« + P)x = ax + fix, 
and a(fix) = ( aP)x . 

A vector space can take many possible forms. For 
example, the set of real-valued functions on an interval 
[a, b] is a vector space over R, and the set of polyno- 
mials of degree less than or equal to n with complex 
coefficients is a vector space over C. Most importantly, 
the sets of n-vectors with real or complex coefficients 
are vector spaces over R and C, respectively. 

An important concept is that of linear independence. 

Vectors V\,V 2 v n in V are linearly independent if 

no nontrivial linear combination of them is zero, that 
is, if the equation oqvi + «2f2 + ■ ■ ■ + a n v n = 0 holds 
only when the scalars «, are all zero. If a collection of 
vectors is not linearly independent then it is linearly 
dependent. 

Given vectors vi,v 2 , ■ ■ ■ ,v n in V we can form their 
span, which is the set of all possible linear combina- 
tions of them. A linearly independent collection of vec- 
tors whose span is V is a basis for V, and any vector 
in V can be written uniquely as a linear combination of 
these vectors. 


The number of vectors in a basis for V is the dimen- 
sion of V, written dimV, and it can be finite or infi- 
nite. The vector space of functions mentioned above is 
infinite dimensional, while the vector space of polyno- 
mials of degree at most n has dimension n + 1, with 
a basis being 1, x, x 2 , ...,x n or any other sequence of 
polynomials of degrees 0, 1, 2, . . . , n. 

A subspace of a vector space V is a subset of V that 
is itself a vector space under the same operations of 
addition and scalar multiplication. 

19.1 Inner Products 

Some vector spaces can be equipped with an inner prod- 
uct, which is a function ( x,y ) of two arguments that 
satisfies the conditions (i) (x,x) ^ 0 and (x,x) = 0 
if and only if x = 0, (ii) {x + y,z) = ( x,z ) + ( y,z ), 
(iii) (ax,y) = a(x,y), and (iv) (x,y) = (y,x) for all 
x,y,z e V and scalars a. The usual (Euclidean) inner 
product on R™ is (x,y) = x T v; on C n the conjugate 
transpose must be used: ( x,y ) = x*y. For the vector 
space C[a,b\ of real-valued continuous functions on 
[a,b\ an inner product is 

f b 

(f,g) = w(x)f(x)g(x) dx, (6) 

J a 

where w(x) is some given, positive weight function, 
while for the vector space of n-vectors of the form 
[f(xi),f(x 2 ),...,f(x n )] J for fixed points x, e [ a,b ] 
and real-valued functions f an inner product is 
n 

if,g) = X WiflXi)g(Xi), (7) 

i=l 

where the Wi are positive weights. Note that (7) is not an 
inner product on the space of real-valued continuous 
functions because (/,/) = 0 implies only that /(x<) = 
0 for all i and not that / = 0. 

The vector space R” with the Euclidean inner product 
is known as n-dimensional Euclidean space. 

19.2 Orthogonality 

Two vectors u, v in an inner product space are orthog- 
onal if (u, v) =0. For R n and C n this is just the usual 
notion of orthogonality: u T v = 0 and u*v = 0, respec- 
tively. A set of vectors {u, } forms an orthonormal set 
if ( Ui,Uj ) = 5ij for all i and j. 

For an inner product space v\ith imrer product (6) 
or (7), useful examples of orthogonal functions are 
orthogonal polynomials [11.29], which have the 
important property that they satisfy a three-term recur- 
rence relation. For example, the Chebyshev polynomials 
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T k satisfy To(x) = 1 , T\(x ) = x, and 

T k+ 1 (x) = 2 xT k (x) -T k ^(x), 1 , (8) 


and they are orthogonal on [-1,1] with respect to the 
weight function (1 - x 2 )~ 1/2 : 

f 1 Ti(x)Tj(x ) 


-1 (1 - X 2 )!/2 


dx = 0, i * j. 


Another commonly occurring class of orthogonal 
polynomials is the Legendre polynomials P k , which are 
orthogonal with respect to w(x) = 1 on [-1,1] and 
satisfy the recurrence 


Oh _l I h 

Pk + i(x) = j-xPfc(x) - — P fc _i(x), (9) 

with Po(x) = 1 and P\(x) = x, when they are normal- 
ized so that Pj ( 1 ) = 1. 

Figure 10 plots some Chebyshev polynomials and 
Legendre polynomials on [- 1 , 1 ]. Both sets of poly- 
nomials are odd for odd degrees and even for even 
degrees. The values of the Chebyshev polynomials 
oscillate between -1 and 1, which is explained by the 
fact that T k (x) = cos (fed), where 0 = cos -1 x. 

A beautiful theory surrounds orthogonal polyno- 
mials and their relations to various other areas of 
mathematics, including Pade approximation, spectral 
theory, and matrix eigenvalue problems. 

If <pi, <p2 , ■■■ is an orthogonal system, that is, 
{<pi,<pj) = 0 for i j, then the <fii are necessarily 
linearly independent. Moreover, in an expansion 


00 

f(x) = X a i4>i (x) (10) 

i=l 

there is an explicit formula for the at. To determine it, 
we take the inner product of this equation with <fij and 
use the orthogonality: 


ifi&j) — X tpj ) — a.j(<fij,<pj ), 

i-l 

so that aj = (/, <t> j) I (<t> j , 4 > j) . 

An important example of an orthogonal system of 
functions that are not polynomials is 1, cosx, sinx, 
cos( 2 x), sin( 2 x), cos( 3 x), ..., which are orthogonal 
with respect to the weight function w(x) = 1 on 
[-tt,tt], and for this basis ( 10 ) is a Fourier series 
expansion. 


19.3 Norms 

A common task is to approximate an element of a vec- 
tor space V by the closest element in a subspace S. To 
define “closest” we need a way to measure the size of 
a vector. A norm provides such a measure. 


A norm is a mapping || ■ || from V to the nonnegative 
real numbers such that || x || = 0 precisely when x = 0, 
|| cxx || = | cx| ||x|| for all scalars « and x 6 V, and the 
triangle inequality ||x + 3II ^ ||x|| + II3II holds for all 
x,y 6 V. There are many possible norms, and on a 
finite-dimensional vector space all are equivalent in the 
sense that for any two norms || ■ || and || ■ ||' there are 
positive constants c\ and C2 such that c\ ||xH' ^ ||x|| ^ 
czllxll' for all x 6 V. 

An example of a norm on C[a, b] is 

ll/ll «. = max |/(x)|, (11) 

xG[a,b] 

known as the L ro -norm, the supremum norm, the max- 
imum norm, or the uniform norm. For p 6 [ 1 , oo), 

a b \ 1 /P 

^ !/(x)| p dxj 

is the Lp-norm on the space L v [a,b] of functions 
for which the (Lebesgue) integral is finite. Important 
special cases are the 12-norm and the Li-norm. 

In an inner product space the natural norm is ||x|| = 
(x,x) 1/2 , and indeed the 12-norm corresponds to the 
inner product (6) with unit weight function. A very 
useful inequality involving this norm is the Cauchy- 
Schwarz inequality ^ 

l<x,y)| 2 ^ (x,x)(y,y) = II x || 2 1| 3^ || 2 
for all x,y 6 V. This inequality shows that we can 
define the angle Q between two vectors x and y by 
cos 0 = (x, 3)|/(||x|| ||.y II) e [- 1 , 1 ]. Thus orthogonal- 
ity corresponds to an angle 6 = ±tt/2. 

Several different norms are commonly used on the 
vector spaces R” and C". The vector p-norm is defined 
for real p by 

/ n \i Ip 

\\x\\ p = (X \Xi\ P ) - 1 < p < 00 . 

V i= 1 

It includes the important special cases 

n 

IMIi = X l^il- 

i-l 

/ n \l/2 

11^112 = ( X \ x i\) = (X*X) 1/2 , 

V i-l ' 

II x || co = max |Xj|. 

The 2 -norm is Euclidean length. The 1 -norm is some- 
times called the “Manhattan” or “taxi cab” norm, as 
when x,3 e R 2 contain the coordinates of two loca- 
tions in Manhattan (which has a regular grid of streets), 
II x - 3 ||i measures the distance by taxi cab from x 
to 3. Figure 11 shows the boundaries of the unit balls 
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Figure 10 Selected (a) Chebyshev polynomials Tfc(x) and (b) Legendre polynomials Pk(x) on [-1, 1]. 
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Figure 11 The boundary of the unit ball 
inR 2 for the 1-, 2-, and oo-norms. 


{ x e M n : ||x|| = 1} for the latter three p-norms. The 
very different shapes of the unit balls suggest that 
the appropriate choice of norm will depend on the 
problem, as is the case, for example, in data fitting 
[IV.9 §3.2]. 

Related to norms is the notion of a metric, defined 
on a set M called a metric space. A metric on M is a 
nonnegative function d such that d(x,y ) = d(y,x) 
(symmetry), d{x,z) ^ d(x,y) + d(y,z) (the triangle 
inequality), and for all x,y,z £ M, d(x,y ) = 0 pre- 
cisely when x = y. An example of a metric on the set 
of positive real numbers is d{x,y) = |log (x/y)\. For a 
normed vector space, the function d(x,y) = \\x - y || 
is always a metric, so a normed vector space is always 
a metric space. 


19.4 Convergence 

We say that a sequence of points x\,xz,..., each 
belonging to a normed vector space V, converges to a 
limit x* e V, written limj -00 X; = x* (or xj ->■ x* as 
i — oo ), if for any s > 0 there exists a positive integer 
N such that ||x* - Xj]| < e for all i ^ N. 

The sequence is a Cauchy sequence if for any e > 0 
there exists a positive integer N such that ]| Xj - Xj || < e 
for all i,j y N. A convergent sequence is a Cauchy 
sequence, but whether or not the converse is true 
depends on the space V. 

A normed vector space is complete if every Cauchy 
sequence in V has a limit in V. A complete normed vec- 
tor space is called a Banach space. In a Banach space we 
can therefore prove convergence of a sequence with- 
out knowing its limit by showing that it is a Cauchy 
sequence. 

A complete inner product space is called a Hilbert 
space. The spaces R n and C" with the Euclidean inner 
product are standard examples of Hilbert spaces. 

20 Operators 

An operator is a mapping from one vector space, U, to 
another, V (possibly the same one). A linear operator 
(or linear transformation) A is an operator such that 
A(aiXi + £X 2 X 2 ) = «iAxi + « 2 Ax 2 for all scalars cxi, o <2 
and vectors Xi , X 2 £ U. For example, the differentiation 
operator is a linear operator that maps the vector space 
of polynomials of degree at most n to the vector space 
of polynomials of degree at most n - 1. 

A natural measure of the size of a linear operator A 
mapping U to V is the induced norm (also called the 
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operator norm or subordinate norm), 

HAH = max : x e U, x oj, 

where on the right-hand side || ■ || denotes both a norm 
on U (in the denominator) and a norm on V (in the 
numerator). For the rest of this section we assume that 
U = V for simplicity. If ||A|| is finite then A is said to 
be a bounded linear operator. On a finite-dimensional 
vector space all linear operators are bounded. 

The definition of an operator norm yields the inequal- 
ities ||Ax|| ^ ||A|| ||x|| (immediate) and ||AB|| ^ 

||A|| ||B|| (using the previous inequality), both of which 
are indispensable. 

The operator A maps vectors in U to other vectors in 
U, and it may change the norm by as much as || A|| . For 
some vectors, called eigenvectors, it is only the norm, 
and not the direction, that changes. A nonzero vec- 
tor v is an eigenvector, with eigenvalue A, if Av = 
Av. Eigenvalues and eigenvectors play an important 
role in many areas of applied mathematics and appear 
in many places in this book. For example, spectral 
theory [IV.8] is about the eigenvalues and eigenvec- 
tors of linear operators on appropriate function spaces. 
The adjective spectral comes from spectrum, which is 
a set that contains the eigenvalues of an operator. 

On taking norms in the relation Av = Av and using 
Ill'll ^ 0 we obtain |A| ^ ||A||. Thus all the eigenvalues 
of the operator A lie in a disk of radius || A|| centered at 
the origin. This is an example of a localization result. 

An invariant subspace of an operator A that maps a 
vector space U to itself is a subspace X of U such that 
AX is a subset of X, so that x e X implies Ax e X. 
An eigenvector is the special case of a one-dimensional 
invariant subspace. 

For nxn matrices, the eigenvalue equation Av = Av 
says that A - AI is a singular matrix, which is equiv- 
alent to the condition p(A) = det(A/ - A) = 0. The 
polynomial p is the characteristic polynomial of A, and 
since it has degree n it follows from the fundamen- 
tal theorem of algebra (section 14) that it has n roots 
in the complex plane, which are the eigenvalues of A. 
Whether there are n linearly independent eigenvectors 
associated with the eigenvalues depends on A and can 
be elegantly answered in terms of the Jordan canoni- 
cal form [11.22]. For real symmetric and complex Her- 
mitian matrices, the eigenvalues are all real and there 
is a set of n linearly independent eigenvectors, which 
can be taken to be orthonormal. If A is in addition 
positive-definite, then the eigenvalues are all positive. 


For matrices on C mxn the operator matrix norms cor- 
responding to the 1,2, and oo vector norms have explicit 
formulas: 

m 

||A||i = max Y laid, “max column sum,” 
KK n ,tj 

n 

||A||oo = max y laid, “max row sum,” 

1 ry'j 

||A||2 = (p(A*A)) 1/2 , spectral norm, 

where the spectral radius 

p(B) = max{|A| : A is an eigenvalue of B}. 

Another useful formula is ||A||2 = cr max (A), where 
cr m ax(A) is the largest singular value [11.32] of A. 
A further matrix norm that is commonly used is the 
Frobenius norm, given by 

/ m n , 1/2 

IIAHf = ( X X \cHj\ 2 ) = (trace(A*A)) 1/2 , 

1=1 J=t 

where the trace of a square matrix is the sum of its 
diagonal elements. Note that ||A||r is just the 2-norm 
of the vector obtained by stringing the columns of A 
out into one long vector. The Frobenius norm is not 
induced by any vector norm, as can be seen by taking 
A as the identity matrix. 

2 1 Linear Algebra 

Associated with a matrix A £ C mxn are four important 
subspaces, two in C m and two in C n : the ranges and 
the nullspaces of A and A* . The range of A is the set 
of all linear combinations of the columns: range(A) = 
{Ax: x £ C' 1 }. The null space of A is the set of vectors 
annihilated by A: null(A) = {x eC n : Ax = 0}. 

The two most important laws of linear algebra are 

dimrange(A) = dimrange(A*), 
dimrange(A) + dimnull(A) = n, 

where dim denotes dimension. These equalities can 
be proved in various ways, one of which is via the 
SINGULAR VALUE DECOMPOSITION [11.32]. 

Suppose x £ null (A). Then x is orthogonal to every 
row of A and hence is orthogonal to the subspace 
spanned by the rows of A. Since the rows of A are the 
columns of A*, it follows that null(A) is orthogonal to 
range(A* ), where two subspaces are said to be orthog- 
onal if every vector in one of the subspaces is orthogo- 
nal to every vector in the other. In fact, it can be shown 
that null(A) and range(A*) together span C”, and this 
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implies that dimrange(A*) + dimnull(A) = n, which 
can also be obtained by combining the two laws. 

The rank of A is the maximum number of linearly 
independent rows or columns of A. The rank plays an 
important role in linear equation problems. For exam- 
ple, a linear system Ax = b has a solution if and only if 
A and the augmented matrix [A b] have the same rank. 

The Fredholm alternative says that the equation 
Ax = b has a solution if and only if b*v = 0 for every 
vector v satisfying A*v = 0. This is a special case of 
more general versions of the alternative, e.g., in inte- 
gral equations [IV.4 §3]. The “only if” part is easy, 
since if A*v = 0 and Ax = b then b*v = iv*b)* = 
(v*Ax)* = ((A*u)*x)* = 0. For the “if” part, sup- 
pose b*v = 0 for every vector v such that A*v = 0. 
The latter equation says that v e null(A*), and from 
what we have just seen this means that v is orthogonal 
to range(A). So every vector orthogonal to range(A) is 
orthogonal to b, which means that b is in range(A) and 
so Ax = b has a solution. 


For many problems, explicit expressions can be 
obtained for the condition number. For a continu- 
ously differentiable function /: R — R, cond(/,x) = 
|x/'(x)//(x)|. For the problem of matrix inversion, 
/(A) = A -1 , the condition number turns out to be 
k(A) = || A|| ||A~ X || for any matrix norm; this is known 
as the condition number of A with respect to inver- 
sion. For the linear system Ax = b, with data the 
matrix A and vector b, the condition number is also 
essentially k(A). 

One role of the condition number is to provide a link 
between the residual of an approximate solution of an 
equation and the error of that approximation. This is 
most easily seen for a nonsingular linear system Ax = 
b. For any approximate solution x the residual satisfies 
r = b - Ax = A(x - x), so the error is related to the 
residual by x - x = A _1 r, which leads to the upper 
bound ||x - x|| ^ »f(A)||r||/||A||. 

23 Stability 


22 Condition Numbers 


A condition number of a problem measures the sensi- 
tivity of the solution to perturbations in the data. For 
some problems there is not a unique solution and the 
problem can be regarded as infinitely sensitive; such 
problems fall into the class of ill-posed problems 
[1.5 §1.2]. Consider a function / mapping a vector space 
to itself such that fix) is defined in some neighbor- 
hood of x. A (relative) condition number for / at x is 
defined by 


cond (f,x) = lim sup 

||5x||^f||x|| 


II fix + 5x) - fix) || 
s\\f (x) II 


The condition number cond measures by how much, 

at most, small changes in the data can be magnified in 

the function value when both changes are measured in 

a relative sense. This definition implies that 

||/(x + 5x) - fix)\\ \\5x\\ nls 

jj-^— yjj ^ cond(/,x)-jj-^jj- + o ( || <5x || ) 

( 12 ) 

and so provides an approximate perturbation bound 
for small perturbations 5x. In practice, 8x in the latter 
bound might represent inherent errors in the data from 
a physical experiment or rounding errors when the data 
is stored on a computer. 

A problem is said to be ill-conditioned if its condi- 
tion number is large and well-conditioned if its condi- 
tion number is small, where the meaning of “large” and 
“small” depends on the context. 


The term “stability” is widely used in applied math- 
ematics, with different meanings that depend on the 
context. A general meaning is that errors introduced in 
the initial stages of a process do not grow (or at least 
are bounded) as the process evolves. Here “process” 
could mean an iteration, a recurrence, or the evolution 
of a time-dependent differential equation. Stability is 
usually a necessary attribute and so a lot of effort is 
put into analyzing whether processes are stable or not. 
Discussions of stability can be found throughout this 
book. 

Here we focus on numerical stability, in the context 
of evaluating a function y = fix) in floating-point 
arithmetic by some given algorithm, where x and y are 
scalars. If y is an approximation to y then one measure 
of its quality is the forward error y - y, which is often 
called, simply, “the error.” The forward error is usually 
unknown and may be difficult to estimate. As an alter- 
native we can ask whether we can perturb the data x so 
that y is the exact solution to the perturbed problem; 
that is, can we find a 5x such that y = fix + 5x)? In 
general, there may be many such 5x\ the smallest pos- 
sible value of \8x\ is called the backward error. If the 
backward error is sufficiently small relative to the pre- 
cision of the underlying arithmetic, then the algorithm 
is said to be backward stable. 

It can be much easier to analyze the backward error 
than the forward error. Backward error analysis origi- 
nates in NUMERICAL LINEAR ALGEBRA [IV. 10 §8], where 
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the underlying errors are rounding errors, but it has 
been used in various other contexts, including in the 
numerical solution of ordinary differential equations. 
Once the backward error is known, the forward error 
can be bounded by using the inequality (12), provided 
that an estimate of the relevant condition number is 
available. 


24 Vector Calculus 

While n-dimensional vector spaces, with n possibly 
infinite, are the appropriate setting for much applied 
mathematics, the world we live in is three dimensional 
and so three coordinates are enough in many situa- 
tions, such as in mechanics. Let i, j, and k denote unit 
vectors along the x-, y-, and z-axes, respectively. As 
this notation suggests, we will use boldface to denote 
vectors in this subsection. A vector x in R 3 can then 
be expressed as x = x\i + X2 j + x>k. The scalar prod- 
uct or dot product of two vectors x and y is x ■ y = 
Xiyi +x 2 y 2 +X 33 / 3 , which is a special case of the Euclid- 
ean inner product of vectors in M n . The cross prod- 
uct or vector product does not have an n-dimensional 
analogue; it is the vector 

i j k 

xx y = x 1 X2 X3 

yi y2 V3 

= (x 2 y 3 - x 3 y 2 )i + (x 3 yi - x 3 y 3 )j 
+ (x\y 2 - x 2 yi)k, 

which is orthogonal to the plane in which x and y lie. 
Note that x x y = -y x x. The vector triple product of 
three vectors x, y , and z is the vector xx (y xz), which 
can be expressed as 

x x (y x z) = (x ■ z)y - (x ■ y)z. 

If / is a scalar function of three variables, then its 
gradient is 




df. df 

dy dz 


k. 


We can think of 


V = 


dx dy “ 


dz 


as an operator: the gradient operator. There is nothing 
to stop us forming the dot product of the two vectors 
V and V/: 

S 2 / d 2 / d 2 / 

dx 2 dy 2 dz 2 ' 

This “del squared” operator is called the Laplacian, A = 
V 2 EE V ■ V. 


V- v/ = 


divF = V ■ F = 


Now let F = Fi i + F 2 j + F3 fc be a vector function map- 
ping R 3 to R 3 . The divergence of F is the dot product 
of V and F: 

cf\ dF 2 df-j 
dx + dy + dz 
Another operator on vector functions that is commonly 
encountered is the curt. curlF = V x F. 

The divergence theorem says that, if V is a vector 
field enclosed by a smooth surface S oriented by an 
outward-pointing unit normal n and F is a continuously 
differentiable vector field over V, then 

||| divFdV = || F ■ ndS. 

In other words, the triple integral of the divergence of 
F over V is equal to the surface integral of the normal 
component, F ■ n. Many equations of physical interest 
can be derived using the divergence theorem. 

Another important theorem is Stokes’s theorem. It 
says that, for an oriented smooth surface S with 
outward-pointing unit normal n, bounded by a smooth 
simple closed curve C, if F is a continuously differen- 
tiable vector field over S, then 

JJ (V x F) ■ ndS = | F ■ tds, 

where t = t(x,y,z) is a unit vector tangential to the 
curve C. Stokes’s theorem says that the integral of the 
normal component of the curl of F over a surface S is 
equal to the integral of the tangential component of F 
along the boundary C of the surface. 


1.3 Methods of Solution 

Nicholas J. Higham 


Problems in applied mathematics come in many shapes 
and forms, and a wide variety of methods and tech- 
niques are used to solve them. In this article we outline 
some key ideas that underlie many different solution 
approaches. 


1 Specifying the Problem 

Before we can set about choosing a method to solve 
a problem we need to be clear about our assump- 
tions. For example, if our problem is defined by a func- 
tion (which could be the right-hand side of a differen- 
tial equation), what can we assume about the smooth- 
ness of the function, that is, the number of continuous 
derivatives? If our problem is to find the eigenvalues 
of an n x n matrix A, are the elements of A explicitly 
stored and accessible or is A given only in the form of 
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a “black box” that takes a vector x as input and returns 
the product Ax? 

The problem of finding a minimum or maximum of 
a scalar function f of n variables provides a good 
example of a wide range of possible scenarios. At one 
extreme, / has derivatives of all orders and we can com- 
pute / and its first and second derivatives at any point 
(most methods do not use derivatives of higher than 
second order). At another extreme, / may be discon- 
tinuous and only function values may be available. It 
may even be that we are not able to evaluate / but only 
to test whether, for a given x and y, f(x) < f(y) 
or vice versa. This is precisely the scenario for an 
optometrist formulating a prescription for a patient. 
The optometrist asks the patient to compare pairs of 
lenses and say which one gives the better vision. By 
suitably choosing the lenses the optometrist is able 
to home in on a prescription within a few minutes. 
In numerical optimization, derivative-free methods 
[IV. 1 1 §4.3] use only function values and many of them 
are based solely on comparisons of these values. 

Another fundamental question is what is meant by a 
solution. If the solution is a function, would we accept 
its representation as an infinite series in some basis 
functions, or as an integral, or would we accept values 
of the function on a finite grid of points? If an inexact 
representation is allowed, how accurate must it be and 
what measure of error is appropriate? 

2 Dimension Reduction 

A common theme in many contexts is that of approx- 
imating a problem by one of smaller dimension and 
using the solution of the smaller problem to approx- 
imate the solution of the original problem. The moti- 
vation is that the large problem may be too expen- 
sive to solve, but of course this approach is viable 
only if the smaller problem can be constructed at low 
cost. In some situations the smaller problem is solved 
repeatedly, perhaps as some parameter varies, thereby 
amortizing the cost of producing it. 

A ubiquitous example of this general approach con- 
cerns images displayed on Web pages. Modern dig- 
ital cameras (even smartphones) produce images of 
5 megapixels (million pixels) or more. Yet even a 27- 
inch monitor with a resolution of 2560 x 1440 pixels 
displays only about 3.7 megapixels. Since most images 
on Web pages are displayed at a small size within a 
page, it would be a great waste of storage and band- 
width to deal with them at their original size. They 


are therefore interpolated down to smaller dimensions 
appropriate for the intended usage (e.g., with longest 
side 400 pixels for an image on a news site). Here, 
dimension reduction is relatively straightforward and 
error is not an issue. 

Often, though, an image is of intrinsic interest and 
we wish to keep it at its original dimensions and reduce 
the required storage, with minimal loss of quality. This 
is the more typical scenario for dimension reduction. 
The reason that dimension reduction is possible is that 
many images contain a high degree of redundancy. 
The singular value decomposition [11.32] (SVD) pro- 
vides a way of capturing the important information in 
an image in a small number of vectors, at least for 
some images. A generally more effective reduction is 
produced by jpeg compression [VII.7 §5], which uses 
two successive changes of basis in order to identify 
information that can be discarded. 

A dynamical system may have many parameters 
but the behavior of interest may take place in a low- 
dimensional subspace. In this case we can try to iden- 
tify that subspace and work within it, gaining a reduc- 
tion in computation and storage. The general term for 
reducing dimension in a dynamical system is model 
reduction [11.26]. Model reduction has been an area of 
intensive research in the last thirty years, with applica- 
tions ranging from the design of very large scale inte- 
gration circuits to data assimilation in modeling the 
atmosphere. 

Dimension reduction is fundamental to data analy- 
sis [IV. 1 7 §4], where large data sets are transformed 
via a change of basis into lower-dimensional spaces 
that capture the behavior of the original data. Clas- 
sic techniques are principal component analysis and 
application of the SVD. In the context of linear matrix 
equations such as the lyapunov equation [111.28], 
an approximation to a dominant invariant subspace 
of the solution (that is, an invariant subspace corre- 
sponding to the k eigenvalues of largest magnitude, for 
some k) can be as useful as an approximation to the 
whole solution, and such an approximation can often 
be computed at much lower cost. 

A term often used in the context of dimension reduc- 
tion is curse of dimensionality, which refers to the fact 
that many problems become much harder in higher 
dimensions and, more informally, that intuition gained 
from two and three dimensions does not necessarily 
translate to higher dimensions. A simple illustration is 
given by an n -sphere, or hypersphere, of radius r in 
R”, which comprises all n-vectors of 2-norm (Euclidean 
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norm) r. A hypersphere has volume 

jj- n / 2 ■y-n 

Sn = T{n / 2 + 1)’ 

where E is the gamma function [III. 13]. Since r(x) ~ 
v '2TT/x(x/e) x (STIRLING’S APPROXIMATION [IV.7 §3]), 
for any fixed r we find that S n tends to 0 as n tends 
to oo, that is, the volume of the hypersphere tends to 
zero, which is perhaps surprising. For a means of com- 
parison, consider the n-cube, or hypercube, with sides 
of length 2r. It has volume 

H n = (2 r) n 

and therefore S n IH n — • 0 as n — ■ oo. In other words, 
most of the volume of a hypercube lies away from 
the enclosed hypersphere, and hence “in the corners.” 
For n = 2, the ratio S 2 /H 2 = 0.785 (see figure 11 in 
THE LANGUAGE OF APPLIED MATHEMATICS [1.2 §19.3]), 
which is already substantially less than 1 . The sequence 

S n /H n continues 0.524, 0.308, 0.164, 0.081 This 

behavior is not too surprising when one realizes that 
any corner of the unit hypercube centered on the ori- 
gin has coordinates [±1, ±1, . . . , ±1] T , and so is at dis- 
tance -Jn from the origin, whereas any point on the 
unit hypersphere centered on the origin is at distance 
1 from the origin, so the latter distance divided by the 
former tends to 0 as n — ■ oo. The term curse of dimen- 
sionality was introduced by Richard Bellman in 1961, 
with reference to the fact that sampling a function of 
n variables on a grid with a fixed spacing requires a 
number of points that grows exponentially with n. 

3 Approximation of Functions 

We consider the problem of approximating a scalar 
function /, which may be given either as an explicit 
formula or implicitly, for example as the solution to an 
algebraic or differential equation. How the problem is 
solved depends on what is known about the function 
and what is required of the solution. We summarize 
some of the questions that must be answered before 
choosing a method. 

• What form do we want the approximation to take: 
power series, polynomial, rational, Fourier series, 

. . . ? 

• Do we want an approximation that has a desired 
accuracy within a certain region? If so, what mea- 
sure of error should be used? 

• Do we want an approximation that has certain 
qualitative features, such as convexity, monotonic- 
ity, or nonnegativity? 


In this section we discuss a few examples of different 
types of approximation, touching on all the questions 
in this list. In the next three subsections / is assumed 
to be real (its argument being written x), whereas in the 
fourth subsection it can be complex (so its argument is 
written z). We consider first approximations based on 
polynomials. 

3.1 Polynomials 

Perhaps the simplest class of approximating functions 
is the polynomials, p n ix) = ao + «ix + ■ ■ ■ + a„x n . 
Polynomials are easy to add, multiply, differentiate, 
and integrate, and their roots can be found by stan- 
dard algorithms. Justification for the use of polyno- 
mials comes from Weierstrass’s theorem of 1885, which 
states that for any / e C[a, b] and any e > 0 there is 
a polynomial p n (x ) such that \\f - p n IU < £, where 
the norm is the I^-norm [1.2 §19.3] given by H/IU = 
max X e[a,b] l/(x)|. Weierstrass’s theorem assures us 
that any desired degree of accuracy in the maximum 
norm can be obtained by polynomials, though it does 
not bound the degree n, which may have to be high. 
Here are some of the ways in which polynomial approx- 
imations are constructed. 

Truncated Taylor series. If / is sufficiently smooth 
that it has a Taylor series expansion and its deriva- 
tives can be evaluated, then a polynomial approxima- 
tion can be obtained simply by truncating the Taylor 
series. The Taylor series with remainder tells us that 
we can write fix) = p n (x) + E n (x), where p n ix) = 
/( 0) +/'( 0)x + ■ ■ ■ +/ (n) (0)x fl /n! is a degree-?! poly- 
nomial and the remainder term has the form E n (x) = 
f(n+ 1) (§)x n+1 /(n + 1)! for some g on the interval with 
endpoints 0 and x. The value of n and the range 
of x for which the approximation /(x) » p„(x) is 
applied will depend on / and the desired accuracy. Fig- 
ure 1 shows the degree-1, degree-3, and degree-5 Taylor 
approximants to the sine function. 

Interpolation. We may require p n (x) to agree with 
fix) at certain specified points x* e [ a,b ]. Since p n 
contains 71+ 1 coefficients and each condition p„(x,) = 
fixt) provides one equation, we need n + 1 points in 
order to specify p n . It can be shown that the n+ 1 inter- 
polation equations in n + 1 unknowns have a unique 
solution provided that the interpolation points {Xj}™ =0 
are distinct, in which case there is a unique interpo- 
lating polynomial. There is a variety of ways of repre- 
senting p n (e.g., Lagrange form, barycentric form, and 
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Figure 1 sinx and its Taylor approximants p i (x) = x, 
P 3 (x ) = x - x 3 1 3!, and ps(x) = x - x 3 /3! + x 5 / 5!. 


divided difference form). An explicit formula is avail- 
able for the error: if / has n + 1 continuous derivatives 
on [a, b] then for any x e [a, b] 

f(n+ Dip \ n 

f(x) - Pnix) = — * n< x 

( n +l)\ “ 

where §. v is some unknown point in the interval deter- 
mined by xo, xi, . . . , x n , and x. This error formula can 
be used to obtain insight into how to choose the X;. It 
turns out that equally spaced points are poor, whereas 
points derived by rescaling to [ a , b] the zeros or 
EXTREMA OF THE CHEBYSHEV POLYNOMIAL [IV.9 §2.2] of 
degree n + 1 or n, respectively, are good. 

Least-squares approximation. In least-squares ap- 
proximation we fix the degree n and then choose the 
polynomial p n to minimize the 12 -norm 

Q b \ 1/2 

|/(x) - p„(x)| 2 dxj , 

where [a, b] is the interval of interest. It turns out 
that there is a unique p n minimizing the error, and 
its coefficients satisfy a linear system of equations 
called the normal equations. The normal equations 
tend to be ill-conditioned when p n is represented in 
the monomial basis, {l,x,x 2 , . . . }, so in this context 
it is usual to write p n = £f=o aifiix), where the 
<fj are orthogonal polynomials [11.29] on [a,b]. In 
this case the normal equations are diagonal and there 
is an explicit expression for the optimal coefficients: 
<H = la fiix)f(x) dx/ \ h a ffx) 2 Ax. 

Loo approximation. Instead of using the 12-norm we 
can use the Loo -norm and so minimize ||/ - p n II oo- A 
best Loo approximation always exists and is unique, 
and there is a beautiful theory that characterizes the 
solution in terms of equioscillation, whereby the error 
achieves its maximum magnitude at a certain number 
of points with alternating sign. An algorithm called 
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Figure 2 Error in polynomial approximations to e* on 
[—1,1]: solid line, L„ 0 approximation; dashed line, Cheby- 
shev interpolant; dotted line, least squares (L .2 approxima- 
tion). 

the Remez algorithm is available for computing the 
best Loo approximation. One use of it is in evaluating 

ELEMENTARY FUNCTIONS [VI.ll], 

Figure 2 plots the absolute error | f -p n (x) | in three 
degree- 10 polynomial approximations to e x on [—1, 1]: 
the least-squares approximation; the L ra approxima- 
tion; and a polynomial interpolant based on the Cheby- 
shev points, cosijn /n), j = 0 : n. Note that the L„ 
approximation has equioscillating error with maximum 
error strictly less than that for the other two approx- 
imations, and that the error of the Chebyshev inter- 
polant is zero at the eleven points where it interpolates, 
which include the endpoints. It is also clear that the 
Chebyshev approximation is not much worse than the 
La, one — something that is true in general. 

3.2 Piecewise Polynomials 

High-degree polynomials have a tendency to wiggle. A 
degree-100 polynomial p has up to 100 points at which 
it crosses the x-axis on a plot of y = p(x): the distinct 
real zeros of p. This can make high-degree polynomials 
unsatisfactory as approximating functions. Instead of 
using one polynomial of large degree it can be better 
to use many polynomials of low degree. This can be 
done by breaking the interval of interest into pieces 
and using a different low-degree polynomial on each 
piece, with the polynomials joined together smoothly 
to make up the complete approximating function. Such 
piecewise polynomials can produce functions with high 
approximating power while avoiding the oscillations 
possible with high-degree polynomials. 

A trivial example of a piecewise polynomial is the 
absolute value function |x|, which is equal to -x for 
x ^ 0 and x for x ^ 0 (see figure 4 on p. 13 in the 
language of applied mathematics [1.2]). More gen- 
erally, a piecewise polynomial g defined on an interval 
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Figure 3 A piecewise-linear function (spline). 



Figure 4 A cubic Bezier curve with four 
control points p i, pz, p 3, pa. 


[ a,b ] =: [xo,x n ] that is the union of n subintervals 
[xo , Xi ] , [xi , X 2 ] , . . . , [x n _ 1 , x n ] is defined by the prop- 
erty that g(x) = Pi(x) for x E [Xi,Xj + i], where each 
Pi is a polynomial. Thus on each interval g is a poly- 
nomial, but each of these individual polynomials is 
in general different and possibly of different degree. 
Such a function is generally discontinuous, but we can 
ensure continuity by insisting that p,-i(Xj) = piixf), 
i = 1 : n — 1 . 

Important examples of piecewise polynomials are 
splines, which are piecewise polynomials g for which 
each individual polynomial has degree k or less and 
for which g has k - 1 continuous derivatives on the 
interval. A spline therefore has the maximum possible 
smoothness. The most commonly used splines are lin- 
ear splines and cubic splines, and an important applica- 
tion is in the finite-element method [11.12]. Figure 3 
shows an example of a linear spline. Splines are com- 
monly used in plotting data, where they provide a way 
of “joining up the dots,” e.g., by straight lines in the 
case of a linear spline. 

In computer-aided design the individual polynomials 
in a piecewise polynomial are often constructed as 
Bezier curves, which have the form 


B n (x) 


^ ( b - x) n l (x ■ 


a) 1 


( b - a) n 


Pi 


for an interval [a, b]. The p, are control points in the 
plane that the user chooses via a graphical interface in 
order to achieve a desired form of curve. Figure 4 shows 
a cubic Bezier curve. The polynomials that multiply 
the pi are called Bernstein polynomials, and they were 
originally introduced by Bernstein in 1912 in order to 
give a constructive proof of Weierstrass’s theorem. The 
use of Bezier curves as a design tool to intuitively con- 
struct and manipulate complex shapes was initiated at 
the Citroen and Renault car companies in the 1960s. 
Today, cubic Bezier curves are widely used, e.g., in the 
design of fonts, in image manipulation programs such 


as Adobe Photoshop, and in the ISO standard for the 
Portable Document Format (PDF). 

3.3 Wavelets 

FOURIER analysis [1.2 §19.2] decomposes a function 
into a linear combination of trigonometric functions 
(sines and cosines) with different frequencies and so is 
a natural way to deal with periodic functions. Wavelet 
analysis, which was first developed in the 1980s, is 
designed to handle nonperiodic functions and does 
so by using basis functions that are rough and local- 
ized. Rather than varying the frequency as with the 
Fourier basis, a wavelet basis is constructed by trans- 
lation (fix) — f(x - 1)) and dilation (f(x) — f(2x)). 
Given a mother wavelet ipix), which has compact sup- 
port (that is, it is zero outside a bounded interval), 
translations and dilations are created as ip( 2 n x - k) 
with integer n and k. This leads to many different res- 
olutions, and hence the term multiresolution analysis 
is used in this context. Larger n correspond to finer 
resolutions, and as k varies the support moves around. 

The localized nature of the wavelet basis functions 
makes wavelet representations of many functions and 
data relatively sparse, which makes wavelets particu- 
larly suitable for data compression, detection of fea- 
tures in images (such as edges and other discontinu- 
ities), and noise reduction. These are some of the rea- 
sons for the success of wavelets in (for example) imag- 
ing, where they are used in the JPEG 2000 standard 
[VII. 7 §5]. 

3.4 Series Solution 

We now turn to the development of exphcit series 
representations of a function. As an example we take 
the Airy function w(z), which satisfies the differential 
equation 

f r r\ 

W - ZW = 0 . 
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We can look for a solution w(z) = !Lk=o a kZ k , where 
a o = w(0) and a\ = tc'(O) can be regarded as given. 
For simplicity we will take ao = 1 and ci\ = 0. Dif- 
ferentiating twice gives ic"(z) = £fc= 2 k(k - l)akZ k ~ 2 . 
Substituting the power series for w and i v" into the 
differential equation we obtain Xfc = 2 k(k - 1 )cikZ k ~ 2 - 
Sfc=o a-kZ k+1 = 0. Since this equation must hold for all 


z we can equate coefficients of z°, z, z 2 


on both 


sides to obtain a sequence of equations that provide 
recurrence relations for the at, specifically (k + 1) (k + 
2)cij; + 2 = along with a 2 = 0. We find that 


iv(z) = 1 + — + 


Z® z® 

6 ’ 180 + 12 960 + ' " ■ 

The modulus of the ratio of successive nonzero terms 
tends to zero as the index of the terms tends to infin- 
ity, which ensures that the series is convergent for all 
z. Since a power series can be differentiated term by 
term within its radius of convergence, it follows that 
our series does indeed satisfy the Airy equation. 

Constructing a series expansion does not always 
lead to a convergent series. Consider the exponential 
integral 

e -* 


Ei(z) 


dt. 


Integrating by parts repeatedly gives 

p -z r<= o „-t 


Ei(z) = 


z 

Q — Z 


t 2 


dt 


e * 
z 2 


e 1 , 
df 
t 3 


1 


■ +(-l) 


k - 1 (k- Di" 


yk—1 


-Rk- 


The remainder term, Rk = (— 1 ) fc k! J“(e _t /t k+1 ) dt, 
does not tend to zero as k — 00 for fixed z, so the series 
is not convergent. Nevertheless, \Rk\ does decrease 
with fc before it increases, and a reasonable approxi- 
mation to E\(z) can be obtained by choosing a suit- 
able value of k. For example, with z = 10 the remain- 
der starts increasing at k = 11, and taking k = 10 we 
obtain the approximation £1 (10) « 4.1 56 x 10 -6 , which 
is to be compared with £i(10) = 4.157 x 10~ 6 , where 
both results have been rounded to four significant fig- 
ures. The series above is an example of an asymptotic 
series. In general, we say that the series Xfe = 0 a kZ ~ k is 
an asymptotic expansion of / as z — • 00 if 


lim z n ff(z) - ^ akZ k ) = 0 

z ~“ ' k = 0 

for every n, and we write /(z) ~ ^=0 a kZ~ k , where the 
symbol is read as “is asymptotic to.” This condition 


can also be written as 

n-1 

/(z) = a k z~ k + 0(z~ n ). 
k = 0 

For the series for Ei we have 



and the latter bound tends to zero as |z — 00 if 
argz e (— rr/2, n/2), so the series is asymptotic under 
this constraint on z. 

By summing an appropriate number of terms, asymp- 
totic series can deliver approximations of up to a cer- 
tain, possibly good, accuracy, for large enough |z|, but 
beyond a certain point the accuracy worsens. 

Suppose we have the quadratic q £ (x) = x 2 - x + 
e = 0, where £ is a small parameter and we wish to 
obtain a series expansion for x as a function of e. This 
can be done by substituting x(e) = Ek=o a k £k into the 
equation and setting the coefficients of each power of 
e to zero. This produces a system of equations that can 
be used to express a 1 , a 2 , ... in terms of ao. The two 
solutions of q £ (x) = 0 for e = 0 are 0 and 1, so we take 
ao = 0, 1 and obtain the series 


( E + e 2 + 2 £ 3 + ■ ■ ■ , ao = 0, 

[l - E - E 2 - 2 £ 3 + ■ ■ ■ , 0-0 = 1 


which describe how the roots 0 and 1 of qo(x) behave 
for small e. Suppose now that it is the leading term 
that is small and that we have the quadratic q £ (x) = 
ex 2 - x + 1 = 0. If we repeat the process of looking for 
an expansion of x(e), we obtain x(e) = 1 + e + 2 e 2 + 
5£ 3 + - ■ ■ describing the behavior of the root 1 of qo(x). 
But q is a quadratic and so has two roots. What has hap- 
pened to the other one? There is a change of degree as 
we go from e = 0 to e £ 0, and this takes us into singu- 
lar perturbation theory [IV.5 §3.2]. In this simple 
case we can use the transformation y = 1/x to write 
q s (x) = q £ (y)/y 2 , and so we obtain expansions for 
x(e) by inverting those in (1). Indeed, inverting the sec- 
ond expression in (1) and expanding in a power series 
recovers the expansion we just derived. 


4 Symbolic Solution 

Sometimes a useful representation of a solution can 
be obtained using a computer symbolic manipulation 
package. Such packages are, for example, very good at 
determining closed forms for indefinite integrals that 
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would be too tedious to derive by hand and for finding 
series expansions. We give four examples, each with a 
different package. 

The MATLAB Symbolic Math Toolbox shows that 
the terms in x 0 ,* 1 ,...,* 6 of the Taylor series of 
tan(sinx) - sin(tanx) are zero: 

» syms x 

» taylor(tan(sin(x)) - si n(tan(x)) , . . . 

’order’ , 8) % Terms up to x ~7 

ans = 
x~ 7/30 

The SymPy package in Python provides a concise 
form for a definite integral (here, the backslash \ 
denotes that a continuation line follows): 

»> from sympy import * 

»> x = Symbol C’x’) 

»> integrate((sin(x) + cos(x) + 1) / \ 

... (sin(x) + cos(x) + 1 + si n(x)*cos(x))) 
2*log(tan(x/2) + 1) 

Maple finds an explicit form for a definite integral: 

> i nt(x" 2 /sin(x) " 2 , x = 0 .. Pi/ 2 ) 

Pi* 1 n( 2 ) 

Mathematica solves the first-order PDE initial-value 
problem ut + 9 u x = u 2 , u(x, 0) = sinx, via the input 

DSolve[{D[u[x, t] , t] +9 D[u[x, t] , x] 

== U[x, t] “2 , 

u[x, 0 ] == Si n [x] } , u[x, t] , {x, t}] 

which yields the output 

{{u [x, t] ->- 1 / (t+Csc [ 9 t-x])}} 

where Csc denotes the cosecant, cscx = 1 / sinx. 

One should never take such results at face value 
and should always run some kind of check, e.g., by 
substituting the claimed solution into the equation 
or comparing a symbolic solution with a numerically 
computed one. 

5 Working from First Principles 

Many mathematical problems can successfully be at- 
tacked from first principles, or with the use of heu- 
ristics, and applied mathematicians often take this 
approach rather than have recourse to general theory. 
Indeed, for unsolved research problems there may 


be no other way forward. For easier problems these 
approaches can provide useful insight and experience. 
The techniques in question include 

• looking for a solution of a particular form, which 
may be constructed from experience or intuition 
or just by making an educated guess, and then 
showing that a solution of that form exists, and 

• deducing what the general form of a solution must 
be and then determining the solution. 

A standard example taught to undergraduate stu- 
dents is finding the general solution of the second- 
order ordinary differential equation (ODE) 

y" + ay' + by = 0 (2) 

for the unknown function y = y(t). The starting point 
is to look for a solution of the form y (f) = e At . Substi- 
tuting this purported solution into the equation gives 
(A 2 + a\ + b)e At = 0 , which implies A 2 + a\ + b = 0 . 
If this quadratic has distinct roots Ai and A2, then two 
linearly independent solutions of the differential equa- 
tion, e Alt and e A2t , have been determined. The general 
solution can be built up, in all cases, from this starting 
point. 

A guess guided by intuition or experience is called 
an ansatz. The assumed form of solution y(t) = e At 
to ( 2 ) is an ansatz. The corresponding ansatz for the 
difference equation y n + ay n - 1 + by n = 0 is y n = A n , 
which again yields the quadratic A 2 + aA + b = 0 , and 
continuing with this line of investigation leads to the 
theory of difference equations. 

An example of working from first principles is to 
determine a formula for the Vandermonde determinant 

/r 1 1 1 

D(x i,X2,X2) = det xi X2 X3 

2 2 2 

V |_xr *2 *3 

Instead of laboriously expanding the determinant one 
can observe that, since it is a sum of products of terms 
with one taken from each row, the result must be a 
multivariate polynomial of the form £ xJx^Xj , with i, 
j, and k distinct nonnegative integers summing to 3 . If 
xi = X2 then the first two columns are linearly depen- 
dent and the determinant is zero; hence Xi - X2 is a 
factor of the determinant. Continuing to argue in this 
fashion, (xi - X2XX2 - X3XX3 - xi) must be a fac- 
tor. Since this product has degree 3 , it must be that 
D(x i,X2,X3) = c(xi -X2MX2 -X3MX3 — Xi) for some 
constant c, and by considering the terms in X2X3 on 
both sides of this equation it is clear that c = 1 . The 
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same argument extends straightforwardly to an n x n 
Vandermonde determinant. An advantage of this kind 
of reasoning is that it is adaptable: if the first row of the 
matrix is replaced by [xf xf xf], then the same form 
of argument can be used. 


6 Iteration 


Suppose we wish to solve the three equations in three 
unknowns 

4x —y = a, 

-x +4 y -z = b, 

-y +4 z = c 


for some given a, b, and c. Because the coefficients on 
the diagonal (the 4s) are the largest coefficients in each 
row, it is reasonable to expect that x ~ a/4, y ~ b/ 4, 
z a c / 4 are reasonable approximations. Note that this 
corresponds to rewriting the equations as 
a y 


x = — + — 


y = y 


4 ’ 

x + z 
4 ’ 

y 

4 


(3) 


and setting x=y=z = 0on the right-hand side. 
If we want to improve our approximation we can try 
plugging it into the right-hand side of (3) and reading 
off the new values from the left-hand side. By repeating 
this process we obtain the iteration 
a yk 
4 4 

b x k + z k 
4 4 


x k + 1 


Vk + 1 


z k +i 


c yy 

4 4 


k= 1 , 2 ,..., 


with xo = yo = zo = 0. This notation means that we 
are defining infinite sequences {x^ } , ly k }, and {z k } by 
the given formulas. 

If the iteration converges, that is, if lun k ^ooX k , 
limfc -,00 yk, and lim^oo z k all exist, then the limits must 
satisfy (3), which is equivalent to the original system, so 
the only possible limits are the required solution com- 
ponents. This iteration is known as the Jacobi iteration. 
It is defined in an analogous way for any linear system 
of equations Ax = b for which the diagonal elements of 
A are nonzero. Convergence can be proved if the matrix 
A has a large diagonal in the sense of being strictly 
DIAGONALLY DOMINANT BY ROWS [IV. 10 §1]. 

The Jacobi iteration is a special case of a powerful 
technique known as fixed-point iteration (or functional 
iteration) for solving a nonlinear system of equations. 


Consider an equation x = fix), where /: R — R. Any 
scalar nonlinear equation can be put in this form. We 
can setup the iteration Xfc+i = f{x k ), with some choice 
of xo 6 R. The iteration may or may not converge, 
as can be seen by considering the case fix) = x 2 , 
for which we have convergence to 0 for |xol < 1 and 
divergence for |xol > 1. But if it does converge, to x, 
say, then x must be a solution of the equation because 
taking limits in the iteration gives x = fix). 

To analyze the convergence of fixed-point iteration 
we note that if x* isasolutionthenx*-Xjt + i = /(x*)- 
fixk) = f'(0k)(x* ~ x k ) for some 6 k lying between 
x* and x k , by the mean-value theorem. Hence |x* - 
Xk+il < |x* - Xkl if \ f i0 k )\ < 1. This observation can 
be made into a proof that x k converges to x* if xo lies 
in a neighborhood of x* in which \ f (x) | is less than 1. 

The most widely used iteration for solving nonlinear 
equations is newton's method [11.28] and its variants. 
Newton’s method for fix) = 0 is 


Xn+l 


= x n - 


f(x n ) 

fiXn) ’ 


where xo is given. In order to apply the method we need 
to be able to compute the derivative fix). Newton’s 
method has a tendency to wander away from the root 
we are trying to compute if not started close enough to 
it. It is therefore common to start the method with a 
good approximation, possibly one computed by some 
other method, or to modify Newton’s method in some 
way that encourages convergence. 

In comparing different classes of iteration, the rate 
of convergence is an important concept. Let {x k } be a 
sequence of scalars converging to x* and denote the 
error in x n by e n = x* - x n . If 


lim 

n— oo 


I Sn+l I 
Wn\ p 


C + 0, 


where C is a constant, the sequence is said to converge 
with rate p (or order p). This definition generalizes to 
vectors by replacing the absolute values with a vector 
norm. Fixed-point iteration has linear convergence in 
general, for which p = 1. Newton's method has (local) 
quadratic convergence ip = 2) to a simple root x*, that 
is, one for which f (x* ) f 0. Quadratic convergence 
is very desirable as it roughly doubles the number of 
correct figures on each step, while linear convergence 
merely reduces the error by a fixed percentage each 
time. Iterations of arbitrarily high order can be derived 
(e.g., by combining several Newton steps into one). 

Fixed-point iteration can be used to iterate on func- 
tions as well as numbers. Consider the ODE initial-value 
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problem 

y'(x) = f(x,y), y(0)=y 0 . 

Integrating between 0 and x leads to the equivalent 
problem 

y(x)=y 0 +{ f{x,y(x))dx, 

Jo 

which is a type of equation known as an integral 
equation [IV.4] because the unknown function occurs 
within an integral. Applying the fixed-point iteration 
idea we can make a guess <fio for y, plug it into the right- 
hand side of the integral equation, and call the result 
<p i- The process can be iterated to produce a sequence 
of functions <p k defined by 

4>k+iW = yo + f f(x,f k (x))dx, kyl. 

Jo 

In general, none of the 4> k will satisfy the differential 
equation, but we might hope that the sequence has a 
limit that does. Let us try out this idea on the problem 

y' = 2x(l + y), y(0) = 0 

using first guess <fi o(x) = 0. Then <p l (x) = Jo 2x dx = 
x 2 and <p 2 (x ) = Iq 2x(l + x 2 ) dx = x 2 + x 4 /2. Contin- 
uing in this fashion yields <fi k (x) = x 2 + x 4 / 2 ! + x 6 / 3 ! + 
■ ■ ■ + x 2k /k\. The limit as k — co exists and is e * 2 - 1, 
which is the required solution. 

The procedure we have just carried out is known as 
Picard iteration, or the method of successive approxi- 
mation. Of course, in most cases it will not be possible 
to evaluate the integrals in closed form, and so Picard 
iteration is not a practical means for computing a solu- 
tion. However, Picard iteration is the basis of the proof 
of the standard result on existence and uniqueness of 
solutions for ODEs. The result says that, if f(x,y) is 
continuous for x 6 [a, b ] and for all y and satisfies a 
Lipschitz condition 

\f(x,u) - fix, u)| ^ L\u - v\ Vx G [a,b], Mu,v , 

with Lipschitz constant L, then for any yo there is 
a unique continuously differentiable function y(x) 
defined on [a, b] that satisfies y' = f(x,y) and 

y(a) = yo- 

7 Conversion to Another Problem 

When we cannot solve a problem it can be useful to 
convert it to a different problem that is more amenable 
to attack. In this section we give several examples of 
such conversions. 


We note first that it is not always obvious what is 
meant by a solution to a problem. Consider the ODE 
problem 

-c- = 1 - 2 xy, 3 ^( 0 ) = 0 . 

dx 

The solution y can be written as 

y(x) = e “* 2 f e t_ df, 

Jo 

which is known as Dawson’s integral or Dawson’s func- 
tion. Which representation of y is better? If we need 
to obtain higher derivatives d k y/dx k , the differen- 
tial equation is more convenient. To evaluate y(x) 
for a given x, numerical methods can be applied to 
either representation. Both representations therefore 
have their uses. 

7.1 Uncoupling 

When we are solving equations, of whatever type, a par- 
ticularly favorable circumstance is when the first equa- 
tion involves only one unknown and each successive 
equation introduces only one new unknown. We can 
then solve the equations from first to last. The simplest 
example is a triangular system of linear equations, such 
as 

anxi = b\ , 

a 2 \X\ + C 122 X 2 = f?2> 

aj,\X\ + 0 , 32 X 2 + 033 X 3 = i> 3 , 

which can be solved by finding xi from the first equa- 
tion, then X 2 from the second, and finally X 3 from the 
third. This is the process known as substitution. 

Most linear equation problems do not have this tri- 
angular structure, but the process of gaussian elim- 
ination [IV. 10 §2] converts an arbitrary linear system 
into triangular form. 

More generally we might have n nonlinear equations 
in n unknowns, and a natural way to solve them is to try 
to manipulate them into an analogous triangular form. 
In computer algebra a way of doing this for polyno- 
mial equations is provided by Buchberger’s algorithm 
for computing a grobner basis [IV.39 §2.1]. 

A triangular problem is partially uncoupled. In a fully 
uncoupled system each equation contains only one 
unknown. A linear system of ODEs y' = Ay with an 
n x n coefficient matrix A can be uncoupled if A is diag- 
onalizable. Indeed, if A = XDX -1 with X nonsingular 
and D = diag(Aj), then the transformation z = X~ k y 
gives z' = Dz, which represents n uncoupled scalar 



36 


I. Introduction to Applied Mathematics 


equations z\ = AjZ*, i = 1 :n. The behavior of the vec- 
tor y can now be understood by looking at the behavior 
of the n independent scalars Zj. 


7.2 Polynomial Roots and Matrix Eigenvalues 


Consider the problem of finding the roots (zeros) of a 
polynomial p n (x) = a n x n + a n - 1 x n ~ 1 + ■ ■ ■ + a o with 
a n / 0, that is, the values of .v for which p n (x) = 0. 
It is known from Galois theory that there is no explicit 
formula for the roots when n ^ 5. Many methods are 
available for computing polynomial roots, but not all 
are able to compute all n roots reliably and software 
might notbe readily available. Consider the nxnmatrix 



CL n -i 1 a n a n ~2 / ■ ■ 


ao/a n 


1 0 


0 

c = 

0 1 


0 



0 



o 

1 

0 


Let A be a root of p n . For the vector defined by y = 
[A n_1 A”' 2 ■ ■ ■ 1] T we have Cy = A y, so A is an eigen- 
value of C with eigenvector y. In fact, the set of roots 
of p is the set of eigenvalues of C, so the polynomial 
root problem has been converted into an eigenvalue 
problem— albeit a specially structured one. The matrix 
C is called a companion matrix. Of course, one can go 
in the opposite direction: to find the eigenvalues of C 
one might look for solutions of det(C - A I) = 0, and 
the determinant is precisely (-l)”p n (A)/a n . 

The eigenvector problem Ax = Ax can be converted 
into a nonlinear system of equations F(v) = 0, where 


F(v) 


(A - A I)x 
ejx - 1 



The last component of F serves to normalize the eigen- 
vector and here s is some fixed integer, with e s denot- 
ing the 5th column of the identity matrix. By solving 
F( v) = 0 we obtain both an eigenvalue of A and the 
corresponding eigenvector. 


7.3 Dubious Conversions 

Converting one problem to an apparently simpler one 
is not always a good idea. The problem of solving the 
scalar nonlinear equation /(x) = 0 can be converted to 
the problem of minimizing the function g(x) = f(x) 2 . 
Since the latter problem has a global minimum attained 
when fix) = 0, the conversion might look attractive. 
However, it has a pitfall: since g'(x) = 2f (x)f(x), the 
derivative of g is zero whenever fix) = 0, and this 


means that methods for minimizing g might converge 
to points that are stationary points of g but not zeros 
of/. 

For another example, consider the generalized eigen- 
problem in n x n matrices A and B, Ax = A Bx, which 
arises in problems in engineering and physics. It is nat- 
ural to attempt to convert it to the standard eigenprob- 
lem B~ l Ax = Ax and then apply standard theory and 
algorithms. However, if B is singular this transforma- 
tion is not possible, and when B is nonsingular but 
ill-conditioned [1.2 §22] the transformation is inad- 
visable in floating-point arithmetic as it will be numeri- 
cally unstable. A further drawback is that if B is sparse 
[IV.10§6] (has many zeros) then B _1 A can have many 
more nonzeros than A or B. 

7.4 High-Order Differential Equations 

Methods of solution of differential equations have been 
more extensively developed for first-order equations 
than for higher-order ones, where order refers to the 
highest derivative in the equation. Fortunately, higher- 
order equations can always be converted to first-order 
ones. Consider the q Ill-order ODE 

y ( q ) _ fit,y,y' y^~^) 

with y,y', . . . , given at t = to- Define new 
variables 

zi = y, z 2 =y', ..., z a =y^~ 1) . 

Then we have the first-order system of equations 

zi = z 2 , 

zi = Z 3 , 


z 'a-\ = 

z'q = /(t,Zi,Z 2 ,...,Zq), 

with zi, z 2 , . . . , z q given at t = to. We can write this 
system in vector form: 

z' = /(t,Z), Z = [Zi,Z 2 Zq ] T . (4) 


So we have traded high order for high dimension. For- 
tunately, the theory and the numerical methods devel- 
oped for scalar first-order ODEs generally carry over 
straightforwardly to vector ODEs. We can go further 
and remove the explicit time dependence from (4) to 
put the system in autonomous form : with w = [t, z T ] T , 
we have 


1 


_/(z)_ 



f(iv 2 ,...,w n ) 


giw). 
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7.5 Continuation 

Suppose we have a hard problem “solve fix) = 0” 
and another problem “solve glx) = 0” that is triv- 
ial to solve. Consider the parametrized problem “solve 
h(x, t) = tf(x ) + (1 - t)g(x) = 0.” We know the solu- 
tion for t = 0 and wish to find it for t = 1. The idea 
of continuation (also called homotopy, or incremental 
loading in elasticity) is to traverse the interval from 0 
to 1 in several steps: 0 < ti < t 2 < ■ ■ ■ < t n = 1. On 
the kth step we use the solution x^-i of the problem 
h(x, tk- i) = 0 as the starting point for an iteration for 
solving Ii(x, t k) =0. We are therefore solving the orig- 
inal problem by approaching it gradually from a trivial 
problem. Continuation cannot be expected to work well 
in all cases. It is particularly well suited to cases where 
/ already depends on a parameter and the problem is 
simpler for some value of that parameter. 

Continuation is a very general technique and has 
close connections with bifurcation theory [IV.21]. 
A special case of it is the idea of shrinking [V.10 §2.2], 
whereby a convex combination is taken of a given 
object with another having more desirable properties. 

8 Linearization 

A huge body of mathematics is concerned with prob- 
lems that are linear in the variables of interest, such as 
a system Ax = b of n linear equations in n unknowns 
or a system of ODEs dy/dt = A(t)y. For linear prob- 
lems it is usually easy to analyze the existence of solu- 
tions, to obtain an explicit formula for a solution, and 
to derive practical methods of solution that exploit the 
linearity. Unfortunately, many real-world processes are 
inherently nonlinear. This means, first of all, that it 
may not be easy to determine whether or not there is 
a solution at all or, if a solution exists, whether it is 
unique. Secondly, finding a solution is in general dif- 
ficult. A general technique for solving nonlinear prob- 
lems is to transform them into linear ones, thereby con- 
verting a problem that we cannot solve into one that 
we can. The transformation can rarely be done exactly, 
so what is usually done is to approximate the nonlin- 
ear problem by a linear one— the process of lineariza- 
tion— and carry out some sort of iteration or refinement 
process. 

To illustrate the idea of linear approximations we 
consider the quadratic equation 

x 2 - lOx +1 = 0. (5) 


Because the coefficient of the linear term, 10, is large 
compared with that of the quadratic term, 1, we can 
think of (5) as a linear equation v\Tth a small quadratic 
perturbation: 



Indeed, if we solve the linear part we obtain x = 1/10, 
which leaves a residual of just 1/100 when substituted 
into the left-hand side of (5). We can therefore say that 
x « 1/10 is a reasonable approximation to a root (in 
fact, to the smallest root, since the product of the roots 
must be 1). Note that this approximation is obtained by 
putting x = 0 in the right-hand side of (6). To obtain 
a better approximation we might try putting x = 1/10 
into the right-hand side. Repeating this process leads 
to the fixed-point iteration 


which yields 0, 0.10, 0.101, After ten iterations we 

have Xio = 0.101020514433644, which is correct to 
the fifteen significant digits shown. Of course we could 
have obtained this solution as x = 5 - V24 using the 
quadratic formula, but the linearization approach gives 
an instant approximation and provides insight. For the 
equation x‘ - lOx +1 = 0, for which there is no explicit 
formula for the roots, 1/10 is an even better approxi- 
mation to the smallest root and the analogue of the 
iteration above converges even more quickly. 

Finearization is the key concept underlying new- 
ton’s method [11.28], which we discussed in section 6. 
Suppose we wish to solve a nonlinear system fix) = 0, 
where /: R" — ■ R”, and let x be an approximation to a 
solution x* . Writing x* = x + h , for sufficiently smooth 
/ we have 0 = /(x*) = fix) +Jix)h + 0(||h|| 2 ), where 
Jix) = (3 fi/dxj) G R nXft is the Jacobian matrix and 
the big-oh term includes the second- and higher-order 
terms from a multidimensional Taylor series. Newton’s 
method approximates / by the linear part of the series 
and so solves the linear system Jix)h = -fix) in 
order to produce a new approximation x + h. The pro- 
cess is iterated, yielding x/c+i = Xfc - J (x^W 1 / (xfe ) . 
Theorems are available that guarantee when the lin- 
ear approximations of the Newton method are good 
enough to ensure convergence to a solution. Indeed 
the Newton-Kantorovich theorem even uses Newton’s 
method itself to prove the existence of a solution under 
certain conditions. 

An equilibrium point (or critical point) of a nonlin- 
ear autonomous system of ODEs y' it) = fiy), where 
/: R” — ■ R", is a vector yo such that fiyo) = 0. For 



38 


I. Introduction to Applied Mathematics 


such a point, y)t) = yo is a constant solution to the 
differential equations. Linear stability analysis deter- 
mines the effect of small perturbations away from the 
equilibrium point. Let y)t) = yo + hit) with hi 0) = 
ho small. We wish to determine the behavior of h(t) 
as t ->■ oo. A linear approximation to f at yo yields 
h'(t) = y'(t ) = fiyo)+Jiyo)h = J(yo)h. The solution 
to this first-order system is h(t) = ef^^ho, and so the 
behavior of h depends on the behavior of the matrix 
exponential [11.14] e J{yo)t . In particular, whether or 
not hit) grows or decays as t — ■ oo depends on the real 
parts of the eigenvalues of Jiyo). For the case where 
y has two components (n = 2), it is possible to give 
detailed classifications and plots (called phase-plane 
portraits) of the different qualitative behaviors that can 
occur. For more on the stability of ODEs see ordinary 
DIFFERENTIAL EQUATIONS [IV.2 §§8, 9]. 

An example of a nonlinear problem that can be 
linearized exactly, without any approximation, is the 
QUADRATIC EIGENVALUE PROBLEM [IV. 1 0 §5.8]. 

Many other uses of linearization can be found 
throughout this book. 


9 Recurrence Relations 


A useful tactic for solving a problem whose solution is 
a number or function depending on a parameter is to 
try to derive a recurrence. For example, consider the 
integral 

fi t n 

Xn = )ot^5 dt ' 

It is easy to verify that x n satisfies the recurrence 
x n + 5x n -i = 1 /n and xo = log(6/5), so values of x n 
can easily be generated from the recurrence. However, 
when evaluating a recurrence numerically, one always 
needs to be aware of possible instability. Evaluating the 
recurrence in IEEE double-precision arithmetic (corre- 
sponding to about sixteen significant decimal digits) we 
find that X 21 = -0.0159 . . . , where the hat denotes the 
computed result. But 


1 

6(n + 1) 


'1 t n rl f n 

— dt < x n < — dt 

Jo t) Jo 5 


1 

5 (n+ 1) 


for all n, so this result is clearly not even of the right 
sign. The cause of the inaccuracy can be seen by con- 
sidering the ideal case in which the only error, e, say, 
occurs in evaluating xo- That error is multiplied by 
- 5 in computing x\ and by a further factor of - 5 on 
each step of the recurrence; overall, x n will be con- 
taminated by an error of (-5) n f. This is an example 


of numerical instability and it is something that recur- 
rences are prone to. We can obtain a more accurate 
result by using the recurrence in the backward direc- 
tion, which will result in errors being divided by - 5, so 
that they are damped out. From the inequalities above 
we see that for large n,x n ~ 1 / ( 5 (n + 1 ) ) . Let us simply 
set y 2 o = 1/105. Then, using the recurrence backward 
intheformx n _i = (l/n-x„)/5, wefindthatxo is com- 
puted with a relative error of order 1 0 1 6 . For similar 
reasons, the recurrence relation in the language of 
applied mathematics [1.2 § 1 3] for the Bessel functions 
is also used in the backward direction for x < n. 

10 Lagrange Multipliers 

Optimization problems abound in applied mathemat- 
ics because in many practical situations one wishes 
to maximize a desirable attribute (e.g., profit, or the 
strength of a structure) or minimize something that is 
desired to be small (such as cost or energy). More often 
than not, constraints impose limits on the variables and 
help to balance conflicting requirements. For example, 
in designing a tripod for cameras we may wish to mini- 
mize the weight of the tripod subject to it being able to 
support cameras up to a certain maximal weight, and 
a constraint might be a lower bound on the maximal 
height of the tripod. 

Calculus enables us to characterize and find max- 
ima and minima of functions. In the presence of con- 
straints, though, the standard results are not so helpful. 
Consider the problem in three variables 

minimize fix i,X 2 ,X 3 > 

subject to c(xi,X 2 ,X 3 ) = 0, (7) 

where the objective function f and constraint function 
c are scalars. We know that any minimizer of the uncon- 
strained problem min/(xi,X 2 ,X 3 ) has to have a zero 
gradient; that is, V/(x) = [dfldxi,dfldx2,5f/dx-}] T 
must be the zero vector. How can we take account of 
the constraint c(xi,X 2 ,X 3 ) = 0? 

Let x* £ l 3 be a feasible point, that is, a point sat- 
isfying the constraint c(x*) = 0. Consider a smooth 
curve z(£) with z(0) = x* that remains on the con- 
straint, that is, c(z(t)) = 0 for all sufficiently small t. 
Differentiating the latter equation and using the chain 
rule gives (dz(t)/dt) T Vc(z(t)) = 0. Setting t = 0 and 
putting p* = dz/dt| t =o gives 

p*Vc(x*) = 0. 


( 8 ) 
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For x* to be optimal, the rate of change of f along z 
must be zero at x* , so, using the chain rule again, 

3 


o=d ~t f(z(t)) 


= y d/ dZj 
t = o jti 3zi dt 


t=o 


= V/(x*) T P*- 


(9) 


Now assume that Vc(x*) f= 0, which is known as a 
constraint qualification. This assumption ensures that 
every vector p* satisfying (8) is the tangent at t = 0 to 
some curve z(t). It then follows that since (8) and (9) 
hold for all p*, 


V/(x*) = A* Vc(x*) (10) 


for some scalar A*. The scalar A* is called a Lagrange 
multiplier. The constraint equation c(x) = 0 and (10) 
together constitute four equations in four unknowns, 
xi, X2, X3, and A. We have therefore reduced the orig- 
inal constrained minimization problem to a nonlinear 
system of equations. The latter system can be solved 
by any means at our disposal, though being nonlinear 
it is not necessarily an easy problem. 

Another way to express our findings is in terms of 
the Lagrangian function I(x, A) = /(x) - Ac(x). Since 
V x I(x, A) = V/(x) - AVc(x), the Lagrange multiplier 
condition (10) says that the solution x* is a stationary 
point of L v\lth respect to x when A = A*. Moreover, 
VaL(x,A) = -c(x),so stationarity of L with respect to 
A expresses the constraint c(x) = 0. 

The development above was presented for a problem 
with three variables and one constraint, but it gener- 
alizes in a straightforward way to n variables and m 
constraints, with A becoming an m -vector of Lagrange 
multipliers. 

Let us see how Lagrange multipliers help us to solve 
the problem 


x 2 y 2 z 2 

maximize 8xvz subject to — ^ + yrr + — ^ = 1, 

J a 2 b 2 c 2 

which defines the maximum rectangular block that fits 
inside the specified ellipsoid. Although our original 
problem (7) is a minimization problem, there is nothing 
in the development of (10) that is specific to minimiza- 
tion, and in fact the latter equation must be satisfied 
at any stationary point, so we can use it here. Setting 
x = x/a, y = y/b, z = z/c, the problem simplifies to 

maximize 8 abcxyz subject to x 2 + y 2 + z 2 = 1. 


The Lagrange multiplier condition is 



yz 


~2x~ 

Babe 

xz 

= A 

2T 


xy 


2z 


It is easily seen that these equations yield x = y = z = 
l/v^3 (and A* = 4ahc/V3) and that the correspond- 
ing volume is 8abcl(3y/3). It is intuitively clear that 
this is a maximum, though in general checking for opti- 
mality requires further analysis involving inspection of 
second derivatives. 

Lagrange multipliers and the Lagrangian function are 
widely used in applied mathematics in a variety of set- 
tings, including the calculus of variations [IV.6] and 
LINEAR AND NONLINEAR OPTIMIZATION [IV. 11]. One of 
the reasons for the importance of Lagrange multipli- 
ers is that they quantify the sensitivity of the optimal 
value to perturbations in the constraints. We can check 
this for our problem. If we perturb the constraint to 
x 2 /a 2 + y 2 /b 2 + z 2 /c 2 = 1 + £, then it is easy to see that 
the solution is V(s) = 8abc((l + f)/3) 1/2 , and hence 
V'(0) = 4abc/*j3 = A*. 

11 Tricks and Techniques 

As well as the general ideas and principles described in 
this article, applied mathematicians have at their dis- 
posal their own bags of tricks and techniques, which 
they bring into play when experience suggests they 
might be useful. Some will work only on very specific 
problems. Others might be nonrigorous but able to give 
useful insight. George Polya is quoted as saying, “A 
trick used three times becomes a standard technique.” 
Here are a few examples of tricks and techniques that 
prove useful on many different occasions, along with a 
very simple example in each case. 

Use symmetry. When a problem has certain symme- 
tries one can often argue that these must carry over into 
the solution. For example, the maximization problem at 
the end of the previous section is symmetric in x, y, 
and z, so one can argue that we must have x = y = z. 
at the solution. 

Add and subtract a term, or multiply and divide by a 
term. As a very simple example, if A and B are nx n 
matrices with A nonsingular, then AB = AB ■ AA^ 1 = 
A(BA)A^ 1 , which shows that AB and BA are similar and 
that they therefore have the same eigenvalues. A com- 
mon scenario is that x is an approximation to x whose 
error cannot be directly estimated, but one can find 
another approximation x whose relation to x and x is 
understood. One then writes x - x = (x - x) + (x - x) 
and thereby obtains, using the triangle inequality, the 
bound || x - x|| ^ ||x - x|| + ||x - x||. For example, x 
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might be the solution to a PDE, x the solution to a dis- 
cretization of the PDE, and x an approximate solution 
to the discretized problem. 

Consider special cases. Insight is often obtained by 
looking at special cases, such as a polynomial in place 
of a general function, or a diagonal matrix instead of a 
full matrix, or n = 2 or 3 in a problem in n dimensions. 

Transform the problem. It is always worth consider- 
ing whether the problem is stated in the best way. As 
a simple example, suppose we are asked to find solu- 
tions of an equation of the form 1/ fix) = a. If / is a 
nearly linear function the problem is better written as 
fix) = 1/a, and for computing a numerical solution 
Newton’s method should work very well. 

Proof by contradiction. A classic technique is to 
prove a result by assuming that the result is false 
and then obtaining a contradiction. Sometimes this 
approach can be used to show that an equation has 
no solution. For example, if the particular Sylvester 
equation [111.28] in nxn matrices AX - XA = I has a 
solution, then 0 = trace (AX) -trace iXA) = trace(AA- 
XA) = trace (/) = n, which is a contradiction; we 
conclude that the equation has no solution. 

Going into the complex plane. It is sometimes possi- 
ble to solve a problem posed on the real line by mak- 
ing an excursion into the complex plane. This tactic 
is often used to evaluate real integrals by using the 

CAUCHY INTEGRAL FORMULA OR THE RESIDUE THEOREM 
[IV.l §15]. For another example, let / be an analytic 
function that is real for real arguments. Then, for real 
x, 

c , , , Imfix + ih) 

/ W “ h 

where i is the imaginary unit and h is a small real 
parameter. This complex step approximation has error 
Oih 2 ). Here is an example with fix) = cosx in 
MATLAB: 

» x = pi/6; h = le-8; 

» fdash_cs = imag( cos(x + i*h) )/h; 

» error = fdash_cs - (-sin(x)) 
error = 

-5 . 5511e-17 

Finally, there are various basic techniques that are 
learned in elementary courses and are always use- 
ful, such as integration by parts, use of the Cauchy- 
Schwarz inequality, and interchange of the order of 


integration in a double integral or of summation in a 
double sum. 


1.4 Algorithms 

Nicholas J. Higham 


An algorithm is a procedure for accomplishing a certain 
task in a finite number of steps. 

The terms “algorithm” and “method” are often used 
interchangeably, and there is no clear consensus on the 
distinction between them. However, many authors use 
“method” very generally and reserve “algorithm” for a 
procedure in which every step is precisely defined. For 
example, Newton’s method solves a system of n non- 
linear equations in n variables, F(x) = 0, using the iter- 
ation x/c+i = Xk - J(xk)~ 1 Fixk), where J is the nxn 
Jacobian matrix of F. An algorithm implementing New- 
ton’s method has to define how to compute the correc- 
tion terms J(xk)~ 1 Fixk) (e.g., by Gaussian elimination 
with partial pivoting), what to do if Jixk) is singular, 
and when to terminate the iteration (e.g., when ||F(xfe|| 
is less than some convergence tolerance). 

Developing algorithms has always been an important 
activity in applied mathematics. George Forsythe put it 
well when he said: 

A useful algorithm is a substantial contribution to 

knowledge. Its publication constitutes an important 

piece of scholarship. 

This statement was made in the 1960s, when the 
emergence of standards for programming languages 
[VI1.11] had started to make it feasible to publish and 
share computer implementations of algorithms. 

1 Summation 

Fet us begin with the problem of forming the sum of n 
numbers, X1+X2 + ■ ■ ■ + x n , on a computer. This prob- 
lem is deceptively simple, partly because the problem 
statement is so suggestive of an algorithm. But note 
that the numbers can be added in any order and that 
parentheses can be inserted in many ways to break up 
the overall sum into smaller sums. So there are many 
ways in which the sum can be computed, and if we 
simply choose one of them without thinking, we may 
miss a better alternative. Thinking about the summa- 
tion process leads to the notion of repeatedly adding 
two numbers from the set then putting the sum back 
into the set. This process can be expressed formally in 
the following algorithm. 
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Algorithm 1. Given numbers x\,...,x n this algorithm 
computes S n = X?=i x i- 

1 LetX = {xi,...,x n }. 

2 while X contains more than one element 

3 Remove two numbers y and z from X 
and put their sum y + z back in X. 

4 end 

5 Assign the remaining element of X to S n - 

This algorithm is written in pseudocode, which refers 
to any informal syntax that resembles a programming 
language. The purpose of pseudocode is to specify an 
algorithm in a concise, readable way without having to 
worry about the details of any particular programming 
language. 

It is clear that after execution of line 3, the number 
of elements of X has decreased by one. The algorithm 
therefore terminates after exactly n - 1 iterations of 
the while loop, at which point it has carried out n - 1 
additions. 

Clearly, different ways of choosing x and y on line 3 
lead to different summation algorithms. The most obvi- 
ous choice is to take V as the sum computed at the 
previous step and z as the element with smallest index 
(and y = x\ and z = x 2 at the first step). The resulting 
algorithm is known as recursive summation and can be 
expressed as 

1 5 = Xl 

2 for i = 2 : n % i goes from 2 to n in steps of 1 

3 5 = 5 + Xi 

4 end 

5 Sn — S 

In this pseudocode everything from a % sign to the end 
of a line is a comment. 

An alternative is to add the elements pairwise, then 
add their sums pairwise, and so on, repeatedly. For n = 
8 this corresponds to the formula 

Ss = U.XI + x 2 ) + (x 3 + X 4 )] + [(X 5 + x 6 ) + (X 7 + x 8 )]. 

This summation algorithm is forming a binary tree 
[11.16], as illustrated in figure 1. An attractive feature 
of pairwise summation is that on a parallel computer 
the summations on each level of the tree can be done 
in parallel, so the computation time is proportional to 
log 2 n instead of n. 

A third summation algorithm, called the insertion 
algorithm, takes the summands y and z as those with 
the smallest absolute value at each stage. 


X 3 + x 2 X3 + X4 

Pi p 2 



P 5 


x 5 + Xg x 7 + x 8 



Pe 



Ss 


Figure 1 A binary tree for pairwise summation, 
with partial sums Pi. 


One reason for considering these different summa- 
tion algorithms as special cases of algorithm 1 is that a 
common analysis can be done of the effects of rounding 
errors in floating-point arithmetic. Express the ith exe- 
cution of the while loop in algorithm 1 as P,- = yi + zi. 
The standard model of floating-point arithmetic (equa- 
tion (2) in FLOATING-POINT ARITHMETIC [11.13]) says 
that the computed Pi satisfies 

p i = X L TT i ’ \Si\^u, i = l:n-l, 

where u is the unit roundoff. (The model actually says 
we should multiply by 1 + Si, but it can be shown to be 
equally valid to divide by 1 + 5j, which is more conve- 
nient here.) The error introduced in forming Pi, namely 
yi + zi~ Pi, is therefore 5;P,'. By summing each of these 
individual errors we obtain the overall error, 

n - 1 

En • — Sn — Sn — X &iPi, 

i = 1 


which is bounded by 


n- 1 

\E n \^U X IPil- 

i=l 

This bound provides the insight that to maximize the 
accuracy of the sum it should be helpful to minimize 
the size of the intermediate partial sums. The insertion 
algorithm can be seen as a heuristic way of doing that. 

When the Xi are all nonnegative and we use recur- 
sive summation, Pi = Xjti Xj, so the bound for E n | is 
minimized if the X* are arranged in increasing order. 
Moreover, since P, <( S n , the bound implies \E n \ < 
(n-l)uS n + 0(u 2 ) for any ordering, which guarantees 
a relative error of order nu. 

This summation example shows that taking a bird’s 
eye view of a computational problem can be benefi- 
cial, in that it can reveal algorithmic possibilities that 
might have been missed and can help a single analy- 
sis to be developed that applies to a wide range of 
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problems. Of course, algorithm 1 does not cover all 
the possibilities. Another way to compute the sum is 
as S n = log e Xi . This formula has little to recom- 
mend it, but it is not so different from the expression 
expin -1 log ZlLtX,), which is a log-Euclidean mean of 
the Xj that has applications when the Xj are structured 
matrices or operators. 

2 Bisection 

The summation problem is unusual in that there is no 
difficulty in seeing the correctness of algorithm 1 or 
its computational cost. A slightly trickier algorithm is 
the bisection algorithm for finding a zero of a contin- 
uous function fix). The bisection algorithm takes as 
input an interval [ a,b ] such that f(a)f(b) < 0; the 
intermediate-value theorem tells us that there must be 
a zero of / on this interval. The bisection algorithm 
repeatedly halves the interval and retains the half on 
which / has different signs at the endpoints, that is, 
the interval on which we can be sure there is a zero. To 
make the algorithm finite we need a stopping criterion. 
The following algorithm terminates once the interval is 
of length at most tol, a given tolerance. 

Algorithm 2 (bisection algorithm). This algorithm 
finds a zero of a continuous function fix) given an 
interval [ a,b ] such that fia) fib) < 0 and an error 
tolerance tol. 

t while b - a > tol 

2 c = ia + b )/ 2 

3 if fic) = 0, quit, end 

4 if fic) f(b) < 0 

5 a = c 

6 else 

7 b = c 

8 end 

9 end 

to x = (a + b)l 2 

To show the correctness of this algorithm note first 
that at the end of the while loop fia) fib) < 0 still 
holds; in other words, this inequality is an invariant 
of the loop. Therefore we have a sequence of intervals 
each of length half the previous interval and all con- 
taining a zero. This means that after k steps we have an 
interval of length ib-a)/2 k containing a zero. The algo- 
rithm therefore terminates after \\og 2 ii\b - a|/tol))l 
steps. Here, we are using the ceiling function \x], which 
is the smallest integer greater than or equal to x. In the 


next section we will also need the floor function LxJ, 
which is the largest integer less than or equal to x. 

The algorithm returns as the approximate zero the 
midpoint of the final interval, which has length at most 
tol; since a zero lies in this interval, the absolute error 
is at most tol/2. 

Algorithm 2 needs a number of refinements to make 
it more reliable and efficient for practical use. First, 
testing whether /(c) and fib) have opposite signs 
should not be done by multiplying them, as the prod- 
uct could overflow or underflow in floating-point arith- 
metic. Instead, the signs should be directly compared. 
Second, /(c) should not be computed twice, on lines 3 
and 4, but rather computed once and its value reused. 
Finally, the convergence test is an absolute one, so 
is scale dependent. A better alternative is \b - a\ > 
tol(|a| + \b \ ), which is unaffected by scalings a — da, 
b - 0b. 

Bisection is a widely applicable technique. For exam- 
ple, it can be used to search an ordered list to see if a 
given element is contained in the list; here it is known as 
binary search. It is also used for debugging. If the ETjX 
source for this article fails to compile and I cannot spot 
the error, I will move the \end{document} command 
to the middle of the file and try again; 1 can thereby 
determine in which half of the file the error lies and 
can repeat the process to narrow the error down. 

3 Divide and Conquer 

The divide and conquer principle breaks a problem 
down into two (or more) equally sized subproblems and 
solves each subproblem recursively. 

An example of how divide and conquer can be 
exploited is in the computation of a large integer power 
of a number. Computing x n in the obvious way takes 
n — 1 multiplications. But x 13 , for example, can be 
written x 8 x 4 x, which can be evaluated in just five 
multiplications instead of twelve by first forming x 2 , 
x 4 = (x 2 ) 2 , and x 8 = (x 4 ) 2 . Notice that 13 = (1101)2 
in base 2, and in general the base 2 representation of 
n tells us exactly how to break down the computation 
of x n into products of terms x 2 \ However, by express- 
ing the computation using divide and conquer we can 
avoid the need to compute the binary representation 
of n. The idea is to write x n = ix n/2 ) 2 if n is even and 
x n = xix^- n/2 ^) 2 if n is odd. In either case the problem 
is reduced to one of half the size. The resulting algo- 
rithm is most elegantly expressed in recursive form, as 
an algorithm that calls itself. 
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Algorithm 3. This algorithm computes x n for a posi- 
tive integer n. 

1 function y = power(x, n ) 

2 if n = 1, y = x, return 

3 if n is odd 

4 y = xpower(x, (n - l)/2) 2 % Recursive call 

5 else 

6 y = power(x, n/2) 2 % Recursive call 

7 end 


The number of multiplications required by algo- 
rithm 3 is bounded above by 2Llog 2 n J. 

Another example of how divide and conquer can 
be used is for computing the inverse of a nonsingu- 
lar upper triangular matrix, T e C nxn . Write T in 
partitioned form as 


Tn T\2 
0 T22 


( 1 ) 


where Tn has dimension [n/ 21. It is easy to check that 

T 1- —1 t _ 1t T"1 

= 11 -ill i 12 i 22 

L 0 T 22 _ ' 

This formula reduces the problem to the computation 
of the inverses of two smaller matrices, namely, the 
diagonal blocks Tn and T 22 , and their inverses can be 
expressed in the same way. The process can be repeated 
until scalars are reached and the inversion is trivial. 


Algorithm 4. This algorithm computes the inverse of 
a nonsingular upper triangular matrix T by divide and 
conquer. 


1 function U = inv(T) 

2 n = dimension of T 

3 if n = 1, un = tn , return 

4 Partition T according to (1), where Tn has 
dimension r n / 21 . 

5 Un = inv(Tn) % Recursive call 

6 U 22 = inv(T 22 > % Recursive call 

7 U12 = — Un T12 f^22- 


Let us now work out the computational cost of this 
algorithm, in flops, where a flop is a multiplication, 
addition, subtraction, or division. Denote the cost of 
calling inv for an n x n matrix by c n and assume for sim- 
plicity that n = 2 k . We then have c n = 2 c n /2 + 2(n/2) 3 , 
where the second term is the cost of forming the 
triangular-full-triangular product U 11 T 12 U 22 of matri- 
ces of dimension n/2. Solving this recurrence gives 
c n = n 3 /3 + 0(n 2 ), which is the same as the cost of 


inverting a triangular matrix by standard techniques 
such as solving TX = I by substitution. 

As these examples show, recursion is a powerful way 
to express algorithms. But it is not always the right tool. 
To illustrate, consider the Fibonacci numbers, 1, 1, 2, 3, 
5, . . . , which satisfy the recurrence f n = fn- 1 +fn - 2 for 
n ^ 2, with fo = f 1 = 1. The obvious way to express 
the computation of the /,■ is as a loop: 

1 /o = l,/i = l 

2 for i = 2 : n 

3 fi = fi-l + fi-2 

4 end 

If just f n is required then an alternative is the recursive 
function 

1 function / = fib(n) 

2 if n ^ 1 

3 /= 1 

4 else 

5 / = fib(n - 1) + fib(n - 2) 

6 end 

The problem with this recursion is that it computes 
fib(n - 1) and fib(n - 2) independently instead of 
obtaining hb(n - 1) from flb(u - 2) with one addition 
as in the previous algorithm. In fact, the evaluation of 
fib(n) requires f„ » 1.6 n operations, so the recursive 
algorithm is exponential in cost versus the linear cost 
of the first algorithm. It is possible to compute f n with 
only logarithmic cost. The idea is to write 

fn - 2 
fn- 3 


'1 1 

n-1 

7T 

1 0_ 


_/o_ 


The matrix [} J]" can be computed in 0(log 2 n) 
operations using the analogue for matrices of algo- 
rithm 3. 

A divide and conquer algorithm can break the prob- 
lem into more than two subproblems. An example is the 
Karatsuba algorithm for multiplying two n-digit inte- 
gers x and y. Suppose n is a power of 2 and write 
x = xilO” /2 + X 2 , y = yi 10 n/2 + y 2 , where xi, x 2 , 
yi, and y 2 are n/2-digit integers. Then 

xy = xo'ilO” + (xiy 2 + 2 c 2 y i)10 n/2 + x 2 y 2 - 


fn 
fn- 1 


1 1 
1 0 


fn- 1 
fn- 2 


1 1 
1 0 
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Computing xy has been reduced to computing three 
half-sized products because Xijy + X 234 = (xi + 
X2)(y 1 + yz) ~ Xiyi - X2y2- This procedure can be 
applied recursively. Denoting by C n the number of 
arithmetic operations (on single-digit numbers) to form 
the product of two n-digit integers by this algorithm, 
we have C n = 3C n /2 + kn and C i = 1, where kn is the 
cost of the additions. Then 

C n = 3(3C„/4 + fen/2) + fen 

= 3(3(3C„/8 + fen/4) + kn/ 2) + kn 
= fen(l + 3/2+ (3/2) 2 + ■ ■ ■ + (3/2) log2 n ) 

» 3kn log2 3 » 3kn L58 , 

where the approximation is obtained by assuming that 
n is a power of 2. The cost is asymptotically less 
than the 0(n 2 ) cost of forming xy by the usual long 
multiplication method taught in school. 

4 Computational Complexity 

The computational cost of an algorithm is usually 
defined as the total number of arithmetic operations 
it requires, though it can also be defined as the execu- 
tion time, under some assumption on the time required 
for each arithmetic operation. The cost is usually a 
function of the problem size, n say, and since the 
growth with n is of particular interest, the cost is 
usually approximated by the highest-order term, with 
lower-order terms ignored. 

The algorithms considered so far all have the prop- 
erty that their computational cost is straightforward 
to evaluate and essentially independent of the data. 
For many algorithms the cost can vary greatly with the 
data. For example, an algorithm to sort a list of num- 
bers might run more quickly when the list is nearly 
sorted. In this case it is desirable to find a bound that 
applies in all cases (a worst-case bound)— preferably 
one that is attainable for some set of data. It is also use- 
ful to have estimates of cost under certain assumptions 
on the distribution of the data. In average-case analy- 
sis, a probability distribution is assumed for the data 
and the expected cost is determined. Smoothed analy- 
sis, developed since 2000, interpolates between worst- 
case analysis and average-case analysis by measuring 
the expected performance of algorithms under small 
random perturbations of worst-case inputs. A number 
of algorithms are known for which the worst-case cost 
is exponential in the problem dimension n whereas the 
smoothed cost is polynomial in n, a prominent exam- 


ple being the simplex method [IV. 11 §3.1] for linear 
programming. 

A good example of a problem for which different 
algorithms can have widely varying cost is the solution 
of a linear system Ax = b, where A is an n x n matrix. 
Cramer’s rule states that X; = det(A,(b))/ det(A), 
where A; (b) denotes A with its ith column replaced by 
b. If the determinant is evaluated from the usual text- 
book formula involving expansion by minors [1.2 §18], 
the cost of computing x is about (n + 1)! operations, 
making this method impractical unless n is very small. 
By contrast, Gaussian elimination solves the system 
in 2n 3 /3 + 0(n 2 ) operations, with mere polynomial 
growth of the operation count with n. However, Gauss- 
ian elimination is by no means of optimal complexity, 
as we now explain. 

The complexity of matrix inversion can be shown 
to be the same as that of matrix multiplication, so it 
suffices to consider the matrix multiplication problem 
C = AB for n x n matrices A and B. The usual formula 
for matrix multiplication yields C in 2n 3 operations. In 
a 1969 paper Volker Strassen showed that when n = 2 
the product can be computed from the formulas 

pi = (an + 0-22 ) (fill + b 2 2 ) j 

P2 = («21 + a 22 )bn, P3 = an(bi 2 - f>22), 

P 4 = a 2 i(b 2 i - bn), p$ = (an + ai 2 )& 22 , 

P 6 = («21 - ailMkll + *>1 2 ), 

P 7 = (ai2 - a22)(b21 + £>22), 


Pi + P4 - P5 + P 7 P 3 + PS 

P2 +PA Pl + Pi - P2 + P6 


The evaluation requires seven multiplications and eigh- 
teen additions instead of eight multiplications and 
eight additions for the usual formulas. At first sight, 
this does not appear to be an improvement. However, 
these formulas do not rely on commutativity so are 
valid when the ay and by are matrices, in which case 
for large dimensions the saving of one multiplication 
greatly outweighs the extra ten additions. Assuming 
n is a power of 2, we can partition A and B into 
four blocks of size n/2, apply Strassen’s formulas 
for the multiplication, and then apply the same for- 
mulas recursively on the half-sized matrix products. 
The resulting algorithm requires 0(n log2 7 ) = 0(n 2S1 ) 
operations. Strassen's work sparked interest in finding 
matrix multiplication algorithms of even lower com- 
plexity. Since there are 0(n 2 ) elements of data, which 
must each participate in at least one operation, the 
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Table 1 The cost of solving an nxn linear system obtained 
by discretizing the two-dimensional Poisson equation. 


Year 

Method 

Cost 

Type 

1948 

Banded Cholesky 

n 2 

Direct 

1948 

Jacobi, Gauss-Seidel 

n 2 

Iterative 

1950 

SOR (optimal parameter) 

n 2 ' 2 

Iterative 

1952 

Conjugate gradients 

n 3/2 

Iterative 

1965 

Fast Fourier transform 

n log n 

Direct 

1965 

Block cyclic reduction 

n log n 

Direct 

1977 

Multigrid 

n 

Iterative 


exponent of n must be at least 2. The current world 
record upper bound on the exponent is 2.3728639, 
proved by Francois Le Gall in 2014. However, all 
existing algorithms with exponent less than that of 
Strassen’s algorithm are extremely complicated and not 
of practical interest. 

An area that has undergone many important algo- 
rithmic developments over the years is the solution of 
linear systems arising from the discretization of par- 
tial differential equations (PDEs). Consider the POISSON 
equation [III. 18] on a square with the unknown func- 
tion specified on the boundary. When discretized on 
an N x TV grid by centered differences, a system of 
n = N 2 equations in n unknowns is obtained with a 
banded, symmetric positive-definite coefficient matrix 
containing O(n) nonzeros. Table 1 gives the domi- 
nant term in the operation count (ignoring the mul- 
tiplicative constant) for different methods, some of 
which are described in numerical linear algebra 
and matrix analysis [IV.10]. For the iterative algo- 
rithms it is assumed that the iteration is terminated 
when the error is of order 10~ 6 . The year is the year 
of first publication, or, for the first two methods, the 
year that the first stored-program computer was opera- 
tional. Since there are n elements in the solution vector 
and at least one operation is required to compute each 
element, a lower bound on the cost is O(n), and this 
is achieved by the multigrid method. The algorithmic 
speedups shown in the table are of a similar magni- 
tude to the speedups in computer hardware over the 
same period. 

4. 1 Complexity Classes 

The algorithms we have described so far all have a cost 
that is bounded by a polynomial in the problem dimen- 
sion, n. For some problems the existence of algorithms 
with polynomial complexity is unclear. In analyzing this 



Figure 2 Complexity classes. It is not known 
whether the classes P and NP are equal. 

question mathematicians and computer scientists use 
a classification of problems that makes a distinction 
finer than whether there is or is not an algorithm of 
polynomial run time. This classification is phrased in 
terms of decision problems: ones that have a yes or no 
answer. The problem class P comprises those problems 
that can be solved in polynomial time in the problem 
dimension. The class NP comprises those problems for 
which a yes answer can be verified in polynomial time. 
An example of a problem in NP is a jigsaw puzzle: it 
is easy to check that a claimed solution is a correctly 
assembled puzzle, but solving the puzzle in the first 
place appears to be much harder. 

A problem is NP-complete if it is in NP and it is pos- 
sible to reduce any other NP problem to it in polyno- 
mial time. Hence if a polynomial-time algorithm exists 
for an NP-complete problem then all NP problems can 
be solved in polynomial time. Many NP-complete prob- 
lems are known, including Boolean satisfiability, graph 
coloring, choosing optimal page breaks in a document, 
and the Battleship game or puzzle. 

A problem (not necessarily a decision problem) is NP- 
hard if it is at least as hard as any NP problem, in 
the sense that there is an NP-complete problem that 
is reducible to it in polynomial time. Thus the NP-hard 
problems are even harder than the NP-complete prob- 
lems. Examples of NP-hard problems are the travel- 
ing SALESMAN PROBLEM [VI. 18], SPARSE APPROXIMA- 
TION [VII. 10], and nonconvex quadratic program- 
ming [IV.l 1 §1.3]. Figure 2 shows the relation among 
the classes. 

An excellent example of the subtleties of computa- 
tional complexity is provided by the determinant and 
the permanent of a matrix. The permanent of an n x n 
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matrix A is 

n 

perm(A) = £ Y\ a U<ro 
<r 1 = 1 

where the vector cr ranges over all permutations of 
the set of integers {1,2, ... ,n}. The determinant has 
a similar expression differing only in that the product 
term is multiplied by the sign (± 1 ) of the permutation. 
Yet while the determinant can be computed in 0(n 3 ) 
operations, by Gaussian elimination, no polynomial- 
time algorithm has ever been discovered for comput- 
ing the permanent. Leslie Valiant gave insight into this 
disparity when he showed in 1979 that the problem of 
computing the permanent is complete for a complexity 
class of counting problems called #P that extends NP. 

The most famous open problem in computer science 
is “is P equal to NP?” It was posed by Stephen Cook 
in 1971 and is one of the seven Clay Institute Millen- 
nium Problems, for each of which a $1 million prize 
is available for a solution. Informally, the question is 
whether the “easy to solve” problems are equal to the 
“easy to check” problems. It is known that P £ NP, so 
the question is whether or not the inclusion is strict. 


However, this formula behaves badly in floating-point 
arithmetic. For example, if n = 3 and xi = 10 000, 
X 2 = 10 001, and xj, = 10 002, then, in IEEE single- 
precision arithmetic (with unit roundoff u « 6 x 10 s ), 
the sample variance is computed as 1.0 by the two-pass 
formula (relative error 0) but 0.0 by the one-pass for- 
mula (relative error 1). The reason for the poor accu- 
racy of the one-pass formula is that there is massive 
subtractive cancelation [11.13] in (3). The original 
formula (2) always yields a computed result with error 
0(nu). Is there a way of combining the speed of the 
one-pass formula with the accuracy of the two-pass 
one? Yes: the recurrence 


Mi = xi, Qi = 0, 


M k = M k _ 


x k - M fc ^i 


Q_k = Qk-1 + 


(k-Dixk-Mk-iV 


k = 2 : n 


calculates Q n , which yields = Q n /(n - 1) and pro- 
duces an accurate result in floating-point arithmetic. 


6 Choice of Algorithm 


5 Trade-off between Speed and Accuracy 

In designing algorithms that run in floating-point arith- 
metic it frequently happens that an increase in speed 
is accompanied by a decrease in accuracy. A classic 
example is the computation of the sample variance of 
n numbers x\, ... ,x n , which is defined as 

1 n 

si = r -*) 2 ’ (2) 

i=i 

where the sample mean 


Computing 5^ from this formula requires two passes 
through the data, one to compute x and the other 
to accumulate the sum of squares. A two-pass com- 
putation is undesirable for large data sets or when 
the sample variance is to be computed as the data is 
generated. An alternative formula, found in statistics 
textbooks (and implemented on many pocket calcula- 
tors and spreadsheets over the years), uses about the 
same number of operations but requires only one pass 
through the data: 

< 3 > 


Much research in numerical analysis and scientific com- 
puting is about finding the best algorithm for solving a 
given problem, and for classic problems such as solv- 
ing a PDE or finding the eigenvalues of a matrix there 
are many possibilities, with improvements continually 
being developed. However, even for some quite elemen- 
tary problems there are several possible algorithms, 
some of which are far from obvious. 

A first example is the evaluation of a polynomial 
p(x) = a 0 + a\x + ■ ■ ■ + a n x n . The most obvious way 
to evaluate the polynomial is by directly forming the 
powers of x. 

1 p = ag + a\x, w = x 

2 for i = 2 : n 

3 w = wx 

4 p = p + a-iW 
3 end 

This algorithm requires 2 n multiplications and n addi- 
tions (ignoring the constant term in the operation 
count). 

An alternative method is Horner's method (nested 
multiplication). It is derived by writing the polynomial 
in nested form: 

p(x) = (■ ■ ■ ((a„x+a n -i)x+a„- 2 )x + ■ ■ ■ +ai)x+ao- 
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Horner’s method is 


1 p = a n 

2 for i = n — 1:— 1:0 % i goes down in steps of - 1 

3 p = px + a; 

4 end 


Horner’s method requires n multiplications and n 
additions, so is significantly less expensive than the 
first method. 

However, even Horner’s method is not optimal. For 
example, it requires eight multiplications for p(x) = 
X s , but this polynomial can be evaluated in just three 
multiplications using algorithm 3 in section 3. More- 
over, for general polynomials of degree n > 4 there 
exist evaluation schemes that require strictly less than 
the 2 n total additions and multiplications required by 
Horner’s rule; the catch is that the operation count 
excludes some precomputation of coefficients that, 
once computed, can be reused for every subsequent 
polynomial evaluation. This latter example emphasizes 
that if one wishes to investigate optimality of schemes 
for polynomial evaluation it is important to be precise 
about what is included in the operation count. 

A second example concerns the continued fraction 


r n (x) = b 0 + 


d\X 


b i + 


a 2 X 


b 2 


CVjX 


bi 


Cln-lX 


b n - 1 + 


(4) 


This continued fraction represents a rational func- 
tion r n (x) = p n (x) /q n (x), where p„ and q n are 
polynomials of degrees \n/2] and Ln/2J, respectively. 
Such continued fractions arise in the approximation 
of transcendental functions by pade approximants 
[IV.9 §2.4]. 

Probably the most obvious way to evaluate the con- 
tinued fraction is by the following bottom-up proce- 
dure. 


Algorithm 5 (continued fraction, bottom-up). This al- 
gorithm evaluates the continued fraction (4) in bottom- 
up fashion. 


1 y„ = ( CLn/b n )x 

2 for j = n - 1 : - 1 : 1 

3 yj = ajx/ibj + y j+1 ) 

4 end 

5 r n = b 0 + yi 


Cost. The total number of operations is n(D + M + A), 
where A, D, and M denote an addition, a division, and 
a multiplication, respectively. 

The bottom-up evaluation requires us to know the 
value of n in advance and is not well suited to evalu- 
ating the sequence ri(x), r 2 (x) since it needs to 

start afresh each time. Top-down evaluation is better in 
this case. The following recurrence dates back to John 
Wallis in 1655. 

Algorithm 6 (continued fraction, top-down). This 
algorithm evaluates the continued fraction (4) in top- 
down fashion. 

1 p - 1 = 1, q_i = 0, po = b 0 , q 0 = 1 

2 for j = 1 : n 

3 Pj = bjpj-x + ajxpj-2 

4 4j = bjdj - 1 + cijXqj-2 

5 end 

6 r n = Pndn 1 

Cost. SnM + 2nA + D. 

It is not obvious that algorithm 6 does actually evalu- 
ate r m (x), but this can be proved inductively. Note that 
top-down evaluation is substantially more expensive 
than bottom-up evaluation. 

These are not the only possibilities. For example, 
when n = 2m one can write r n in partial fraction form 
r n (x) = <XjX/(x - Pj), where the fij (assumed 

to be distinct) are the roots of q n , which permits 
evaluation at a cost of m(2M + A + D). 

In practice, one needs to consider the effect of round- 
ing errors on the evaluation, and this is in general 
dependent on the particular coefficients a, and b, and 
on the range of x of interest. 

When there are several algorithms for solving a prob- 
lem one may need a polyalgorithm, which is a set of 
algorithms with rules for choosing between them. A 
good example of a polyalgorithm is the MATLAB back- 
slash operator, which enables a solution to a linear sys- 
tem Ax = b to be computed using the syntax A\b. 
The matrix A may be square or rectangular; diago- 
nal, triangular, full, or sparse; and symmetric, symmet- 
ric positive-definite, or completely general. Underlying 
the backslash operator is an algorithm for identify- 
ing which of these properties hold and choosing the 
appropriate matrix factorization and solution process 
to apply. 
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7 Randomized Algorithms 

All the algorithms described so far in this article are 
deterministic: if they are run repeatedly on the same 
data, they produce the same result every time. Some 
algorithms make random choices and so generally pro- 
duce a different result every time they are run. For 
example, we might approximate 

f f(x) dx ~ - £/(*), 

Jo n 

where the Xi are independent random numbers uni- 
formly distributed in [0, 1]. The standard deviation of 
the error in this approximation is of order n~ 1/2 . This 
is an example of a Monte Carlo algorithm, such algo- 
rithms have a deterministic run time and produce an 
output that is correct (or has a given accuracy) with 
a certain probability. Of course, there are much more 
efficient ways to estimate a one-dimensional integral, 
but Monte Carlo algorithms come into their own for 
multidimensional integrals over complicated domains. 

A Las Vegas algorithm always produces a correct 
result, but its run time is nondeterministic. A classic 
example is the quicksort algorithm for sorting a list of 
numbers, for which a randomized choice of the parti- 
tion element makes the algorithm much faster on aver- 
age than in the worst case (O(nlogn) running time 
versus 0(n 2 ), for n numbers). 

Randomized algorithms can be much simpler than 
deterministic alternatives, they may be more able to 
exploit modern computing architectures, and they may 
be better suited to large data sets. There is a wide vari- 
ety of randomized algorithms, and they are studied in 
mathematics, computer science, statistics, and other 
areas. 

One active area of research is randomized algorithms 
for numerical linear algebra problems, based on ran- 
dom sampling and random projections. For example, 
fast algorithms exist for computing low-rank approx- 
imations to a given matrix. The general framework is 
that random sampling is used to identify a subspace 
that captures most of the action of the matrix, the 
matrix is then compressed to this subspace, and a 
low-rank factorization is computed from the reduced 
matrix. 

Examples of randomized algorithms mentioned in 
this book are the Google PageRank algorithm [VI.9], 
with its use of a random surfer, the fc-MEANS algo- 
rithm [IV.l 7 §5.3] for clustering, and markov chain 
MONTE CARLO ALGORITHMS [V.ll §3]. 


8 Some Key Algorithms 
in Applied Mathematics 

Table 2 lists a selection of algorithms mentioned in 
this book. Very general methods such as precondition- 
ing and the Unite-element method, which require much 
more information to produce a particular algorithm, 
are omitted. The table illustrates the wide variety of 
important algorithms in applied mathematics, ranging 
from the old to the relatively new. 

A notable feature of some of the algorithms is that 
they are iterative algorithms, which in principle take an 
infinite number of steps, for solving problems that can 
be solved directly, that is, in a finite number of opera- 
tions. The conjugate gradient and multigrid methods 
are iterative methods for solving a linear system of 
equations, and for suitably structured systems they 
can provide a given level of accuracy much faster than 
Gaussian elimination, which is a direct method. Sim- 
ilarly, interior point methods are iterative methods 
for linear programming, competing with the simplex 
method, which is a direct method. 

Further Reading 

A classic reference for algorithms and their analy- 
sis is Donald Knuth’s The Art of Computer Program- 
ming. The first volume appeared in 1968 and the devel- 
opment is ongoing. Current volumes are Fundamen- 
tal Algorithms (volume 1), Seminumerical Algorithms 
(volume 2), Sorting and Searching (volume 3), and 
Combinatorial Algorithms (volume 4), all published by 
Addison-Wesley (Reading, MA). 

Bentley, J. L. 1986. Programming Pearls. Reading, MA: 
Addison-Wesley. 

Brassard, G., and P. Bratley. 1996. Fundamentals ofAlgorith- 
mics. Englewood Cliffs, NJ: Prentice-Hall. 

Cormen, T. H., C. E. Leiserson, R. L. Rivest, and C. Stein. 2009. 
Introduction to Algorithms, 3rd edn. Cambridge, MA: MIT 
Press. 

Higham, N. J. 2002. Accuracy and Stability of Numerical 
Algorithms, 2nd edn. Philadelphia, PA: SIAM. 


1.5 Goals of Applied Mathematical 
Research 

Nicholas J. Higham 

A large body of existing mathematical knowledge is 
encapsulated in theorems, methods, and algorithms, 
some of which have been known for centuries. But 
applied mathematics is not simply the application of 
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Table 2 Some algorithms mentioned in this book. 


Algorithm Reference Key early figures 


Gaussian elimination IV.10 §2 


Newton’s method 

11.28 

Fast Fourier transform 

11.10 

Cholesky factorization 

IV.10 §2 

Remez algorithm 

IV.9 §3.5, VI. 11 §2 

Simplex method 
(linear programming) 

IV.ll §3.1 

Conjugate gradient and 
Lanczos methods 

IV.10 §9 

Ford-Fulkerson algorithm 

IV.37 §7 

k-means algorithm 

IV.17 §5.3 

QR factorization 

IV.10 §2 

Dijkstra’s algorithm 

VI. 10 

Quasi-Newton methods 

IV.ll §4.2 

QR algorithm 

IV.10 §5.5 

QZ algorithm 

IV.10 §5.8 

Singular value 
decomposition 

11.32 

Strassen’s method 

1.4 §4 

Multigrid 

IV.10 §9, IV.13 §3, IV.16 

Interior point methods 

IV.ll §3.2 

Generalized minimal 
residual method 

IV.10 §9 

Fast multipole method 

VI. 17 

JPEG 

VII.7 §5, VII.8 

PageRank 

VI.9 

HITS 

1.1 


Ancient Chinese (ca. 1 C.E.), Gauss (1809); formulated 
as LU factorization by various authors from 1940s 

Newton (1669), Raphson (1690) 

Gauss (1805), Cooley and Tukey (1965) 

Cholesky (1910) 

Remez (1934) 

Dantzig (1947) 

Hestenes and Stiefel (1952), Lanczos (1952) 

Ford and Fulkerson (1956) 

Lloyd (1957), Steinhaus (1957) 

Givens (1958), Householder (1958) 

Dijkstra (1959) 

Davidon (1959), Broyden, Fletcher, Goldfarb, Powell, 
Shanno (early 1960s) 

Francis (1961), Kublanovskaya (1962) 

Moler and Stewart (1973) 

Golub and Kahan (1965), Golub and Reinsch (1970) 
Strassen (1968) 

Fedorenko (1964), Brandt (1973), Hackbusch (1977) 
Karmarkar (1984) 

Saad and Schulz (1986) 

Greengard and Rokhlin (1987) 

Members of the Joint Photographic Experts Group (1992) 
Brin and Page (1998) 

Kleinberg (1999) 


existing mathematical ideas to practical problems: new 
results are continually being developed, usually build- 
ing on old ones. Applied mathematicians are always 
innovating, and the constant arrival of new or modified 
problems provides direction and motivation for their 
research. 

In this article we describe some goals of research 
in applied mathematics from the perspectives of the 
ancient problem of solving equations, the more con- 
temporary theme of exploiting structure, and the prac- 
tically important tasks of modeling and prediction. We 
also discuss the strategy behind research. 

1 Solving Equations 

A large proportion of applied mathematics research 
papers are about analyzing or solving equations. The 


equations may be algebraic, such as linear or nonlin- 
ear equations in one or more variables. They may be 
ordinary differential equations (ODEs), partial differen- 
tial equations (PDEs), integral equations, or differential- 
algebraic equations. 

The wide variety of equations reflects the many dif- 
ferent ways in which one can attempt to capture the 
behavior of the system being modeled. Whatever the 
equation, an applied mathematician is interested in 
answering a number of questions. 

1.1 Does the Equation Have a Solution? 

We are interested in whether there is a unique solu- 
tion and, if there is more than one solution, how many 
there are and how they are characterized. Existence of 
solutions may not be obvious, and one occasionally 
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hears tales of mathematicians who have solved equa- 
tions for which a proof is later given that no solution 
exists. Such a circumstance may sound puzzling: is it 
not easy to check that a putative solution actually is 
a solution? Unfortunately, checking satisfaction of the 
equation may not be easy, especially if one is working 
in a function space. Moreover, the problem specifica- 
tion may require the solution to have certain proper- 
ties, such as existence of a certain number of deriva- 
tives, and the claimed solution might satisfy the equa- 
tion but fail to have some of the required properties. 
Instead of analyzing the problem in the precise form in 
which it is given, it may be better to investigate what 
additional properties must be imposed for an equation 
to have a unique solution. 

1.2 Is the Equation Well-Posed? 

A problem is well-posed if it has a unique solution 
and the solution changes continuously with the data 
that define the problem. A problem that is not well- 
posed is ill-posed. For an ill-posed problem an arbitrar- 
ily small perturbation of the data can produce an arbi- 
trarily large change in the solution, which is clearly an 
unsatisfactory situation. 

An example of a well-posed problem is to determine 
the weight supported by each leg of a three-legged 
table. Assuming that the table and its legs are perfectly 
symmetric and the ground is flat, the answer is that 
each leg carries one-third of the total weight. For a table 
with four legs each leg supports one-quarter of the total 
weight, but if one leg is shortened by a tiny amount then 
it leaves the ground and the other three legs support 
the weight of the table (a phenomenon many of us have 
experienced in restaurants). For four-legged tables the 
problem is therefore ill-posed. 

For finite-dimensional problems, uniqueness of the 
solution implies well-posedness. For example, a linear 
system Ax = b of n equations in n unknowns with 
a nonsingular coefficient matrix A is well-posed. Even 
so, if A is nearly singular then a small perturbation of 
A can produce a large change in the solution, albeit 
not arbitrarily large: the condition number [1.2 §22] 
k(A) = ||A||||A _1 || bounds the relative change. But 
for infinite-dimensional problems the existence of a 
unique solution does not imply that the problem is well- 
posed; examples are given in the article on integral 
equations [IV.4 §6]. 

The notion of well-posedness was introduced by 
Jacques fiadamard at the beginning of the twentieth 


century. He believed that physically meaningful prob- 
lems should be well-posed. Today it is recognized that 
many problems are ill-posed, and they are routinely 
solved by reformulating them so that they are well- 
posed, typically by a process called regularization 
[IV.15 §2.6] (see also integral equations [IV.4 §7]). 

An important source of ill-posed problems is inverse 
problems [IV.15]. Consider a mathematical model in 
which the inputs are physical variables that can be 
adjusted and the output variables are the result of an 
experiment. The forward problem is to predict the out- 
puts from a given set of inputs. The inverse problem is 
to make deductions about the inputs that could have 
produced a given set of outputs. In practice, the mea- 
surements of the outputs may be subject to noise and 
the model may be imperfect, so uncertainty quan- 
tification [11.34] needs to be carried out in order 
to estimate the uncertainty in the predictions and 
deductions. 

1.3 What Qualitative Properties Does a Solution 
Have? 

It may be of more interest to know the behavior of a 
solution than to know the solution itself. One may be 
interested in whether the solution, /(t) say, decays as 
t — co, whether it is monotonic in t, or whether it oscil- 
lates and, if so, with what fixed or time-varying fre- 
quency. If the problem depends on parameters, it may 
be possible to answer these questions for a range of 
values of the parameters. 

1.4 Does an Iteration Converge? 

As we saw in methods of solution [1.3], solutions 
are often computed from iterative processes, and we 
therefore need to understand these processes. Various 
facets of convergence may be of interest. 

• Is the iteration always defined, or can it break 
down (e.g., because of division by zero)? 

• For what starting values, and for what class of 
problems, does the iteration converge? 

• To what does the iteration converge, and how does 
this depend on the starting value (if it does at all)? 

• How fast does the iteration converge? 

• How are errors (in the initial data, or round- 
ing errors introduced during the iteration) prop- 
agated? In particular, are they bounded? 
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To illustrate some of these points we consider the 
iteration 

Xfc+i = -Up - l)xk + xl~ p a], (1) 

p 

with p a positive integer and as C, which is Newton’s 
method for computing a pth root of a. We ask for which 
a and which starting values xq the iteration converges 
and to what root it converges. The analysis is simplified 
by defining y k = 0~ 1 x k , where 0 is a pth root of a, as 
the iteration can then be rewritten 

yk+i = ~UP ~ l)yk + 3'k” P ]. y o = 0~ 1 x o , (2) 

p 

which is Newton’s method for computing a pth root 
of unity. The original parameters a and xq have been 
combined into the starting value y o. 

Figure 1 illustrates the convergence of the iteration 
f or p = 2 , 3 , 5 . For yo ranging over a400x400 grid with 
Rej'oJmyo £ [-2.5, 2.5], it plots the root to which y k 
from (2) converges, with each root denoted by a dif- 
ferent grayscale from white (the principal root, 1) to 
black. Convergence is declared if after fifty iterations 
the iterate is within relative distance 1CT 13 of a root; 
the relatively small number of points for which conver- 
gence was not observed are plotted white. For p = 2 
the figure suggests that the iteration converges to 1 if 
started in the open right half-plane and - 1 if started in 
the open left half-plane, and this can be proved to be 
true. But for p = 3,5 the regions of convergence have 
a much more complicated structure, involving sectors 
with petal-like boundaries. 

The complexity of the convergence for p ^ 3 was 
first noticed by Arthur Cayley in 1879, and an analy- 
sis of convergence for all starting values requires the 
theory of Julia sets of rational maps. However, for prac- 
tical purposes it is usually principal roots that need to 
be computed, so from a practical viewpoint the main 
implication to be drawn from the figure is that for 
p = 3,5 Newton’s method converges to 1 for yo suf- 
ficiently close to the positive real axis— and it can be 
proved that this is true. 

We see from this example that the convergence analy- 
sis depends very much on the precise question that 
is being asked. The iteration (1) generalizes in a nat- 
ural way to matrices and operators, for which the 
convergence results for the scalar case can be exploited. 

2 Preserving Structure 

Many mathematical problems have some kind of struc- 
ture. An example with explicit structure is a linear sys- 
tem Ax = b in which the n x n matrix A is a toeplitz 


matrix [1.2 §18]. This system has n 2 + n numbers in A 
and b but only 3n - 1 independent parameters. On the 
other hand, if for the vector ODE y' = f(t,y) there is 
a vector v such that v T f(t,y) = 0 for all t and y, then 
(d/d t)v T y(t) = v T f(t,y) = 0, so v J y(t) is constant 
for all t. This conservation or invariance property is a 
form of structure, though one more implicit than for 
the Toeplitz system. 

An example of a nonlinear conservation property is 
provided by the system of ODEs 

u'(t) = v(t), 
v’(t) = -u(t). 


For this system, 

-^-(n 2 + v 2 ) = 2 (it'it + v’v) = 2 (vu - uv) = 0, 

dt 

so there is a quadratic invariant. In particular, for the 
initial values u(0) = 1 and v(0) =0 the solution is 
u(t) = cost and v(t) = - sint, which lies on the unit 
circle centered at the origin in the uv -plane. If we solve 
the system using a numerical method, we would like the 
numerical solution also to lie on the circle. In fact, one 
potential use of this differential equation is to provide 
a method for plotting circles that avoids the relatively 
expensive evaluation of sines and cosines. Consider the 
following four standard numerical methods applied to 
our ODE system. Here, u k » u(kh) and v k » v(kh), 
where h is a given step size, and ito = 1 and vq = 0: 


Forward Euler 


Backward Euler 


Trapezium method 


Leapfrog method 


f u k+ 1 = u k + hv k , 

1 v k+1 =v k - hu k , 
f u k + 1 = Ujfc + hv k+ 1 , 

1 Vfc+i = vie- hu k + 1 , 
f u k+ 1 = u k + h(v k + Vk+i)/2, 
1 v k+ i = v k - h(u k + u k + 1 )/ 2 , 
f u k+ 1 = u k + hv k , 

1 Vk + 1 = v k - hu k+ 1 . 


Figure 2 plots the numerical solutions computed with 
32 steps of length h = 2rr/32. We see that the for- 
ward Euler solution spirals outward while the backward 
Euler solution spirals inward. The trapezium method 
solution stays nicely on the unit circle. The leapfrog 
method solution traces an ellipse. This behavior is easy 
to explain if we write each method in the form 


Zfc+i — Gz k , z k 


u k 

Vk 


where G = [ } h j ] for the Euler method, for example. 
Then the behavior of the sequence z k depends on the 
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p = 2 p = 3 



- 2-1012 - 2-1012 



Figure 1 Newton iteration for a pth root of unity. Each point yo in the region is shaded according 
to the root to which the iteration converges, with white denoting the principal root, 1. 


eigenvalues of the matrix G. It turns out that the spec- 
tral radius of G is greater than 1 for forward Euler 
and less than 1 for backward Euler, which explains 
the spiraling. For the trapezium rule G is orthogonal, 
so Hzfc.fi || 2 = II Zfc || 2 and the trapezium solutions stay 
exactly on the unit circle. For the leapfrog method the 
determinant of G is 1, which means that areas are pre- 
served, but G is not orthogonal so the leapfrog solution 
drifts slightly off the circle. 

The subject of geometric integration [IV. 1 2 §5] is 
concerned more generally with methods for integrat- 
ing nonlinear initial-value ODEs and PDEs in a way that 
preserves the invariants of the system, while also pro- 
viding good accuracy in the usual sense. This includes, 
in particular, symplectic integrators [IV. 12 §1.3] for 
Hamiltonian systems. 

3 Modeling and Prediction 

As WHAT IS APPLIED MATHEMATICS? [1.1 §1] explains, 
modeling is the first step in solving a physical prob- 
lem. Models are necessarily simplifications because it is 
impractical to incorporate every detail. But simple mod- 
els can still be useful as tools to explore the broad con- 
sequences of physical laws. Moreover, the more com- 
plex a model is the more parameters it has (all of which 
need estimating) and the harder it is to analyze. 

In their 1987 book Empirical Model-Building and 
Response Surfaces, Box and Draper ask us to 

Remember that all models are wrong; the practical 
question is how wrong do they have to be to not be 
useful. 

Road maps illustrate this statement. They are always a 
simplified representation of reality due to representing 
a three-dimensional world in two dimensions and dis- 
playing wiggly roads as straight lines. But road maps 


are very useful. Moreover, there is no single “correct” 
map but rather many possibilities depending on reso- 
lution and purpose. Another example is the approxima- 
tion of tt. The approximation tt a 3.14 is a model for 
tt that is wrong in that it is not exact, but it is good 
enough for many purposes. 

It is difficult to give examples of the modeling process 
because knowledge of the problem domain is usually 
required and derivations can be lengthy. We will use for 
illustration a very simple model of population growth, 
based on the logistic equation 



Here, N(t) is a representation in a continuous variable 
of the number of individuals in a population at time t, 
r > 0 is the growth rate of the population, and K > 0 
is the carrying capacity. For K = o o, the model says 
that the rate of change of the population, dN/dt, is 
rN\ that is, it is proportional to the size of the popu- 
lation through the constant r, so the population grows 
exponentially. For finite K, the model attenuates this 
rate of growth by a subtractive term rN 2 /K, which can 
be interpreted as representing the increasing effects 
of competition for food as the population grows. The 
logistic equation can be solved exactly for N(t) (see 
ordinary differential equations [IV.2 §2]). Labora- 
tory experiments have shown that the model can pre- 
dict reasonably well the growth of protozoa feeding on 
bacteria. However, for some organisms the basic logis- 
tic equation is not a good model because it assumes 
instant responses to changes in population size and so 
does not account for gestation periods, the time taken 
for young to reach maturity, and other delays. A more 
realistic model may therefore be 

diV(t) ( N(t- t)\ 
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Figure 2 Approximations to the unit circle computed by four different numerical integrators with 
step size h = 2tt/ 32. The dotted line is the unit circle; asterisks denote numerical approximations. 


where t > 0 is a delay parameter. At time t, part of 
the quadratic term is now evaluated at an earlier time, 
t - t. This delay differential equation has oscillatory 
solutions and has been found to model well the popu- 
lation of lemmings in the Arctic. Note that in contrast to 
the predator-prey model [1.2 §10], the delayed logis- 
tic model can produce oscillations in a population with- 
out the need for a second species acting as predator. 
There is no suggestion that either of these logistic mod- 
els is perfect, but with appropriate fitting of parame- 
ters they can provide useful approximations to actual 
populations and can be used to predict future behavior. 

3.1 Errors 

A lot of research is devoted to understanding the 
errors that arise at the different stages of the modeling 
process. These can broadly be categorized as follows. 

Errors in the mathematical model. Setting up the 
model introduces errors, since the model is never exact. 
These are the hardest errors to estimate. 

Approximation errors. These are the errors incurred 
when infinite-dimensional equations are replaced by a 
finite-dimensional system (that is, a continuous prob- 
lem is replaced by a discrete one: the process of dis- 
cretization), or when simpler approximations to the 
equations are developed (e.g., by model reduction 
[11.26]). These errors include errors in replacing one 
approximating space by another (e.g., replacing con- 
tinuous functions by polynomials), errors in finite- 
difference [11.11] approximations, and errors in trun- 
cating power series and other expansions. 

Rounding errors. Once the problem has been put in a 
form that can be solved by an algorithm implemented 
in a computer program, the effects of the rounding 
errors introduced by working in finite-precision arith- 
metic need to be determined. 


Analysis of errors may include looking at the effects 
of uncertainties in the model data, including in any 
parameters in the model that must be estimated. This 
might be tackled in a statistical sense using techniques 
from uncertainty quantification [11.34]— indeed, if 
the model has incompletely known data then proba- 
bilistic techniques may already be in use to estimate 
the missing data. Sensitivity of the solution of the 
model may also be analyzed by obtaining worst-case 
error bounds with the aid of condition numbers 
[ 1.2 § 22 ], 

3.2 Multiphysics and Multiscale Modeling 

Scientists are increasingly tackling problems with one 
or both of the following characteristics: (a) the sys- 
tem has multiple components, each governed by its 
own physical principles; and (b) the relevant processes 
develop over widely different time and space scales. 
These are called multiphysics and multiscale problems, 
respectively. An example of both is the problem of mod- 
eling how space weather affects the Earth, and in partic- 
ular modeling the interaction of the solar wind (the flow 
of charged particles emitted by the sun) with the Earth’s 
magnetic field. Different physical models describe the 
statistical distribution of the plasma, which consists 
of charged particles, and the evolution of the electric 
and magnetic fields, and these form a coupled non- 
linear system of PDEs. The length scales range from 
millions of kilometers (the Earth-sun distance) to hun- 
dreds of meters, and the timescales range from hours 
down to 10~ 5 seconds. Problems such as this pose chal- 
lenges both for modeling and for computational solu- 
tion of the models. The computations require high- 
performance computers [VII. 12], and a particular 
task is to present the vast quantities of data generated 
in such a way that users, such as forecasters of space 
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weather, can explore and interpret them. More on the 
issues of this section can be found in the articles on 
COMPUTATIONAL SCIENCE [IV.16] and VISUALIZATION 
[VII. 13]. 

3.3 Computational Experiments 

The step in the problem solution when computational 
experiments are carried out might seem to be the eas- 
iest, but it can in fact be one of the hardest and most 
time-consuming, for several reasons. It can be hard to 
decide what experiments to carry out, and it may be 
necessary to refine the experiments many times until 
useful or satisfactory results are obtained. The compu- 
tations may have a long run time, even if executed on 
a high-performance computer. 

Many pitfalls can be avoided by working to modern 
standards of reproducible research [VIII.5], which 
require that programs, data, and results be recorded, 
documented, and made available in such a way that 
the results can be reproduced by an independent 
researcher and, just as importantly, by the original 
author. 

3.4 Validation 

The process of validation involves asking the ques- 
tion, “Have we solved the right equations?” This is not 
to be confused with verification, which asks whether 
the equations in the model have been solved correctly. 
Whereas verification is a purely mathematical question, 
validation intimately involves the underlying physical 
problem. A classic way to validate results from a model 
is to compare them with experimental results. However, 
this is not always feasible, as we may be modeling a 
device or structure that is still in the design phase or 
on which experiments are not possible (e.g., the Earth’s 
climate). 

Validation may not produce a yes or no answer but 
may instead indicate a range of parameters for which 
the model is a good predictor of actual behavior. 

Validation may be the first step of an iterative refine- 
ment procedure in which the steps in figure 1 are 
repeated, with the second and subsequent invocations 
of the first step now comprising adjustments to the cur- 
rent model. Assuming it is feasible to carry out refine- 
ment, there is much to be said for starting with the sim- 
plest possible model and building gradually toward an 
effective model of minimal complexity. 


4 Strategies for Research and Publishing 

Analysis of the research literature in applied mathe- 
matics reveals some common features that can be built 
into a list of strategies for doing research. 

(i) Solve an open problem or prove the truth or falsity 
of a conjecture that has previously been stated in 
the literature. 

(ii) Derive a method for solving a problem that occurs 
in practice and has not been effectively solved 
previously. Problems of very large dimension, for 
which existing techniques might be impractical, 
are good hunting grounds. 

(iii) Prove convergence of a method for which the 
existing convergence theory is incomplete. 

(iv) Spot some previously unnoticed phenomenon and 
explain it. 

(v) Generalize a result or algorithm to a wider class of 
problems, obtaining new insight in doing so. 

(vi) Provide a new derivation of an existing result or 
algorithm that yields new insight. 

(vii) Develop a new measure of cost or error for a prob- 
lem and then derive a new algorithm that is better 
than existing algorithms with respect to that met- 
ric. For example, instead of measuring computa- 
tional cost in the traditional way by the number 
of elementary arithmetic operations, also include 
the cost of data movement when the algorithm is 
implemented on a parallel computer. 

(viii) Find hidden assumptions in an existing method 
and remove them. For example, it may seem obvi- 
ous that multiplying two 2x2 matrices requires 
eight multiplications, but Strassen showed that 
only seven multiplications are needed (see [1.4 §4] 
for the relevant formulas), thereby deriving an 
asymptotically fast method for matrix multiplica- 
tion. 

(ix) Rehabilitate an out-of-favor method by showing 
that it can be made competitive again by exploit- 
ing new research results, problem requirements, 
or hardware developments. 

(x) Use mathematical models to gain new insight into 
complex physical processes. 

(xi) Use mathematical models to make quantitative 
predictions about physical phenomena that can 
lead to new procedures, standards, etc., in the tar- 
get field. Here it will probably be necessary to work 
with researchers from other disciplines. 
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Nowadays, publishing the results of one’s research is 
more important than ever. Funding bodies expect to see 
publications, as they provide evidence that the research 
has been successful and they help to disseminate the 
work. Assessment of researchers and their institutions 
increasingly makes use of metrics, some of which relate 
to publications, such as the number of citations a paper 
receives and the ranking of the journal in which it is 
published. There is therefore a tension between pub- 
lishing prolihcahy (which, taken to the extreme, leads 
to breaking research up into “least publishable units”) 
and publishing fewer, longer, more-considered papers. 

In addition to the traditional journals and confer- 
ence proceedings that publish (usually) refereed arti- 
cles, nowadays there are many outlets for unrefer- 
eed manuscripts, including institutional eprint servers, 
the global arXiv eprint service, and personal Web 
pages. Blogs provide yet another venue for publish- 
ing research, usually in the form of shorter articles 
presented in a more accessible form than in a regu- 
lar paper. The nonjournal outlets provide for instant 
publication but have varying degrees of visibility and 
permanence. 

The balance between publication of journals and 
books in print only, in both print and electronic form, 
and purely in electronic form has been changing for 
the past decade or more, and the advent of handheld 
devices such as smartphones and tablets has accel- 
erated developments. Equally disruptive is the move- 
ment toward open-access publishing. Traditionally, the 
publishers of mathematics journals did not charge an 
author to publish an article but did charge institutions 
to subscribe to the journal. In recent years a new model 
has been introduced in which the author pays to pub- 
lish an article in a journal and the article is freely 
available to all. 

While we can be sure that there will always be outlets 
for publishing research, it is difficult to predict how the 
forms that these outlets take v\ill evolve in the future. 


1.6 The History of Applied 
Mathematics 

June Barrow-Green and Reinhard 
Siegmund-Schultze 

And as for the Mixt Mathematikes I may only make this 
prediction, that there cannot faile to bee more kinds of 
them, as Nature growes furder disclosed. 

Francis Bacon: Of the Proficience and 
Advancement of Learning (1605) 



Figure 1 Detail from a pictorial representation of J. le R. 
d’Alembert’s “Systeme des Connoissances Humaines” in 
the supplement to volume 2 of I'Encydopedie (1769), men- 
tioning “mixed mathematics,” for which Francis Bacon had 
predicted a great future. 

1 Introduction 

What is applied mathematics? This is a difficult ques- 
tion-one to which there is no simple answer. The mas- 
sive growth in applications of mathematics within and 
outside the sciences, especially since World War II, has 
made this question even more problematic, the increas- 
ing overlap with other disciplines and their methods 
adding further to the difficulties, creating problems 
that border on the philosophical. 

Given the fact that almost every part of mathematics 
is potentially applicable, there are mathematicians and 
historians who consider the term “applied mathemat- 
ics” primarily as a term of social distinction or a matter 
of attitude. One such was William Bonnor, the mathe- 
matician and gravitational physicist, who in 1962 in a 
lecture on “The future of applied mathematics” said the 
following: 

Applied mathematics, as I should like it to be under- 
stood, means the application of mathematics to any 
subject, physical or otherwise; with the proviso that 
the mathematics shall be interesting and the results 
nontrivial. An applied mathematician, on this view, is 
somebody who has been trained to make such applica- 
tions, and who is always prepared to look for situations 
where fruitful application is possible. As such, he is not 
a physicist manque. I therefore see applied mathemat- 
ics as an activity, or attitude of mind, rather than as a 
body of knowledge. 
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Not everyone will agree with Bonnor. Some take 
a more methodological approach and almost equate 
applied mathematics with “mathematical modeling.” 
Others are of a more concrete, mathematical mind and 
insist that there are parts of mathematics that are 
per se more or less applicable than others. We find 
Bonnor’s definition appealing because it stresses the 
social dimension of the mathematical working process 
and allows a historical understanding of the notion of 
applied mathematics. 

The importance of “attitudes” notwithstanding, by 
any definition applied mathematics has to be “gen- 
uine” mathematics in the sense that it aims at and/or 
uses general statements (theorems) even if the piece of 
mathematics in question has not yet been fully logically 
established. In fact, the applicability of mathematics is 
mainly based on its “generality,” which in relation to 
fields of application often appears as “abstractness.” 
This applies even to relatively elementary applications 
such as the use of positional number systems. 

Applications of mathematics, even on a nonelemen- 
tary level, have been possible because certain prac- 
tices and properties, such as algorithms for approxima- 
tions or geometrical constructions, have always existed 
within mathematics itself and have led to spontaneous 
or deliberate applications. While, as the universal math- 
ematician John von Neumann observed in 1947, in pure 
mathematics many problems and methods are selected 
for aesthetic reasons, in applied mathematics, prob- 
lems considered at the time as urgent have priority, and 
the choice of methods often has to be subordinated to 
the goals in question. However, attitudes and values, 
which often had and continue to have strong politi- 
cal and economic overtones, have always been instru- 
mental in deciding exactly which parts of mathematics 
should be emphasized and developed. Since attitudes 
have to be promoted through education, this puts a 
great responsibility on teaching and training and makes 
developments in that area an important topic for a 
history of applied mathematics. 

Of course, many modern and recent applications rely 
on older mathematical ideas in differential equations, 
topology, and discrete mathematics and on estab- 
lished notation and symbolism (matrices, quaternions, 
Laplace transforms, etc.), while important new develop- 
ments in integral equations [IV.4], measure theory, 
vector and tensor analysis, etc., at the turn of the 
twentieth century have added to these ideas. 

However, in the twentieth century, three major sci- 
entific and technical innovations both changed and 


enlarged the notion of applied mathematics. In rough 
chronological order, these are mathematical modeling 
in a broad, modern sense, stochastics (modern proba- 
bility and statistics), and the digital computer. These 
three innovations have, through their interactions, 
restructured applied mathematics. They were princi- 
pally established after World War II, and it was also 
only then that the term “mathematical modeling” came 
to be more frequently used for activities that had hith- 
erto usually been expressed by less concise words such 
as “problem formulation and evaluation.” In addition 
to these innovations, which are essentially concerned 
with methodology, several totally new fields of applica- 
tion, such as electrical engineering, economics, biology, 
meteorology, etc., emerged in the twentieth century. 

While in 1914 one of the pioneers of modern applied 
mathematics, Carl Runge, still doubted whether “the 
name of ‘applied mathematics’ was chosen appropri- 
ately, because when applied to empirical sciences it 
still remains pure mathematics,” the three major inno- 
vations listed above would radically alter and extend 
the notion of mathematics and, in particular, that of 
applied mathematics. Due to these innovations, the 
modern disciplines at the interface of mathematics and 
engineering, such as cybernetics, control theory, com- 
puter science, and optimization, were all able to emerge 
in the 1940s and 1950s in the United States (Wiener, 
Shannon, Dantzig) and the Soviet Union (Andronov, 
Kolmogorov, Pontryagin, Kantorovich) independently, 
and to a somewhat lesser degree in England (Tur- 
ing, Southwell, Wilkinson), France (Coufhgnal), and 
elsewhere. These innovations also gradually changed 
“hybrid disciplines,” such as electrical engineering and 
aerodynamics, that had originated at the turn of the 
century. In the case of aerodynamics, not only were 
statistical explanations of turbulence [V.21] increas- 
ingly proposed after World War II, but also conformal 
mappings [II. 5] gradually lost importance in favor of 
computational fluid dynamics [IV.28]. Within opera- 
tions research, with its various approaches and tech- 
niques (linear programming, optimization methods, 
statistical quality control, inventory control, queuing 
analysis, network flow analysis), mathematical con- 
cepts, especially mathematical models, acquired an 
even stronger foothold than in the more traditional 
industrial engineering. 

One typical modern mathematical discipline that inti- 
mately combines pure and applied aspects of the sub- 
ject and that is intertwined with various other scientific 
(physical and biological) and engineering disciplines 



1.6. The History of Applied Mathematics 


57 


is the theory of dynamical systems [IV.20]. After 
initial work in the field by Poincare, Lyapunov, and 
Birkhoff, the theory fell into oblivion until the 1960s. 
This falling away can be explained by fashion (such 
as the trend toward the mathematics of Bourbaki), by 
new demands in applications connected to dissipative 
systems, and by the partial invisibility of the Russian 
school in the West. With the advent of modern com- 
puting devices, the shape of the discipline changed 
dramatically. Mathematicians were empowered com- 
putationally and graphically, the visualization of new 
objects such as fractals was made possible, and appli- 
cations in fields such as control theory [IV.34] and 
meteorology— quantitative applications as well as qual- 
itative ones— began to proliferate. The philosophical 
discussion about mathematics and applications has 
also been enriched by this discipline, with the public 
being confronted by catchwords such as chaos [II.3], 
catastrophe, and self-organization. However, the pro- 
cess whereby the various streams of problems con- 
verged and led to the subject’s modern incarnation is 
complex: 

In the 1930s, for example, what could the socio- 
professional worlds of the mathematician Birkhoff 
(professor at Harvard), the “grand old man of radio” 
van der Pol (at the Philips Research Lab), and the 
Soviet “physico [engineer] mathematician” Andronov 
at Gorki have had in common? What, in the 1950s, 
had Kolmogorov’s school in common with Lefschetz’s? 

It is precisely this manifold character of social and 
epistemic landscapes that poses problem[s] in this 
history. 

Aubin and Dahan (2002) 

The role played by the three major innovations 
continues largely unabated today, as is evident, for 
instance, from a 2012 report from the Society for Indus- 
trial and Applied Mathematics (SIAM) on industrial 
mathematics: 

Roughly half of all mathematical scientists hired into 
business and industry are statisticians. The second- 
largest group by academic specialty is applied mathe- 
matics. Compared to the 1996 survey, fewer graduates 
reported “modeling and simulation” as an important 
academic specialty for their jobs, and more reported 
“statistics.” Programming and computer skills con- 
tinue to be the most important technical skill that new 
hires bring to their jobs. 

By separating statistics from applied mathematics, 
the SLAM report follows a certain tradition, caused in 


part by institutional boundaries, such as the existence 
of separate statistics departments in universities. This 
distinction is also partly followed in the present volume 
and in this article, although there is no doubt about the 
crucial role of probability and statistics in applications. 
For example, one need only consider the Monte Carlo 
method— notably developed at Los Alamos in the 1940s 
by Stanislaw Ulam and von Neumann, and continued by 
Nicholas Metropolis— which is now used in a wide vari- 
ety of different contexts including numerical integra- 
tion, optimization, and inverse problems. In addition, 
the combination of stochastics and modeling in biolog- 
ical and physical applications has had a philosophical 
dimension, contributing to the abandonment of rigid 
causality in science, e.g., through Karl Pearson’s corre- 
lation coefficient and Werner Heisenberg's uncertainty 
principle. However, by the end of the 1960s the limi- 
tations of stochastics in helping us to understand the 
nature of disorder had become apparent, particularly 
in connection with the study of complex (“chaotic”) 
dynamical systems. Nevertheless, stochastics contin- 
ues to play an important role in the development of 
big theories with relevance for applications, including 
statistical models for weather forecasting. 

1.1 Further Themes and Some Limitations 

Putting stochastics on the sidelines is but one of sev- 
eral limitations of this article — limitations that are 
the result of a lack of space, a lack of distance, and 
more general methodological considerations. A fur- 
ther thematic restriction concerns industrial mathe- 
matics, which figures separately from applied mathe- 
matics in the very name of SLAM, although there are 
obvious connections between the two, in particular 
with respect to training and in developing attitudes 
toward applications. Knowing that industrial mathe- 
matics has changed, and above all expanded, from its 
origins in the early twentieth century to move beyond 
its purely industrial context, these connections become 
even clearer. Industrial mathematics is, today, an estab- 
lished subdiscipline, loosely described as the modeling 
of problems of direct and immediate interest to indus- 
try, performed partly in industrial surroundings and 
partly in academic ones. 

The history of mathematical instruments, includ- 
ing both numerical and geometrical devices, and their 
underlying mathematical principles is another topic we 
have had to leave out almost completely. Some dis- 
cussion of the history of mathematical table projects 
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is the farthest we reach in this respect. This limita- 
tion also applies to the technological basis of mod- 
ern computing and the development of software tech- 
nology (covered by Cortada’s excellent bibliography), 
which has provided, and continues to provide, an 
important stimulus for the development (and fund- 
ing) of applied mathematics. In 2000, in the Journal 
of Computational and Applied Mathematics, it was esti- 
mated that of the increase in computational power, half 
should be attributed to improved algorithms and half 
should be attributed to the increase in computational 
hardware speeds. Computing technology has contin- 
ued to advance rapidly, and companies are making 
more and more aggressive use of high-performance 
computing [VII. 12] 

A detailed discussion of the fields of application 
of mathematics themselves— be it in (pure) mathemat- 
ics, the sciences, engineering, economics and finance, 
industry, or the military — is absent from this article for 
a number of reasons, both practical and methodologi- 
cal, above all the huge variation of specific conditions 
in these fields. 

This particularly affects the role of mathematics in 
the military, to which we will devote only scattered 
remarks and no systematic discussion. While there are 
still considerable lacunae in the literature on mathe- 
matics during World War I (although some of these have 
been filled by publications prepared for the centenary), 
there is more to be found in print about mathematics 
in World War II, not least because of the increased role 
of that discipline in it. (We recommend Booh-Bavnbek 
and Hoyrup (2003) as a good place to start to find out 
more than is covered in our article.) 

Another topic that deserves broader coverage than 
is possible here is the history of philosophical reflec- 
tion about mathematical applications. This is particu- 
larly true for the notion of “mathematical modeling” 
taken in the sense of problem formulation. Accord- 
ing to the Oxford Encyclopedic Dictionary (1996), the 
new notion of a mathematical model was used first 
in a statistical context in 1901. At about the same 
time, the French physicist and philosopher Pierre 
Duhem accused British physicists of still using the term 
“model” only in the older and narrower sense of mate- 
rial, mechanical, or visualizable models. Duhem there- 
fore preferred the word “analogy” for expressing the 
relationship between a theory and some other set of 
statements. Particularly with the upswing of “math- 
ematical modeling” since the 1980s, a broad litera- 
ture, often with a philosophical bent, has discussed 


the specificity of mathematics as a language, as an 
abstract unifier and a source of concepts and prin- 
ciples for various scientific and societal domains of 
application. Another (though not unrelated) develop- 
ment in the philosophy of applied mathematics con- 
cerns the growing importance of algorithmic aspects 
within mathematics as a whole. It was no coincidence 
that in the 1980s, with the rise of scientific comput- 
ing, several “maverick” philosophers of mathematics, 
such as Philip Kitcher and Thomas Tymosczko, entered 
the scene. They introduced the notion of “mathematical 
practice,” by which they meant more than simply appli- 
cations. One of the features of the maverick tradition 
was the polemic against the ambitions of mathematical 
logic to be a canon for the philosophy of mathematics, 
ambitions that have dominated much of the philosophy 
of mathematics in the twentieth century. The change 
was inspired by the work of both those mathemati- 
cians (such as Philip Davis and Reuben Hersh) and those 
philosophers (including Imre Lakatos and David Cor- 
held) who were primarily interested in the actual work- 
ing process of mathematicians, or what they sometimes 
called “real mathematics.” Meanwhile, the philosophi- 
cal discussion of mathematical practice has been pro- 
fessionalized and reconnected to the foundationalist 
tradition. It usually avoids premature discussion of “big 
questions” such as “Why is mathematics applicable?” 
or “Is the growth of mathematics rational?” restrict- 
ing its efforts to themes of mathematical practice in 
a broader sense, like visualization, explanation, purity 
of methods, philosophical aspects of the uses of com- 
puter science in mathematics, and so on. An overview 
of the more recent developments in the philosophy of 
mathematical practice is given in the introduction to 
Mancosu (2008). 

Unfortunately, there is also little space for biograph- 
ical detail in this article, and thus no bow can be 
given to the great historical heroes of applied math- 
ematics, such as Archimedes, Ptolemy, Newton, Euler, 
Laplace, and Gauss. Nor is there room to report on 
the conversions of pure mathematicians into applied 
mathematicians, such as those undergone by Alexan- 
der Ostrowski, John von Neumann, Solomon Lefschetz, 
Ralph Fowler, Garrett Birkhoff, and David Mumford, all 
personal trajectories that paralleled the global devel- 
opment of mathematics. In any case, any systematic 
inclusion of biographies could not be restricted to 
mathematicians, considering the term in its narrow- 
est sense. In an influential report on industrial math- 
ematics in the American Mathematical Monthly of 
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1941, Thornton Fry spoke about a “contrast between 
the ubiquity of mathematics and the fewness of the 
mathematicians.” Indeed, historically, engineers such 
as Theodore von Karman, Richard von Mises, Lud- 
wig Prandtl, and Oliver Heaviside; physicists such as 
Walter Ritz, Aleksandr Andronov, Cornelius Lanczos, 
and Werner Romberg; and industrial mathematicians 
such as Balthasar van der Pol have, by any measure, 
made significant contributions to applied mathemat- 
ics. In addition, several pioneers of applied mathemat- 
ics, such as Gaspard Monge, Felix Klein, Mauro Picone, 
Vladimir Steklov, Vannevar Bush, and John von Neu- 
mann, actively used political connections. The actions 
of nonscientists, and particularly politicians, have also 
therefore played a part in the development of the sub- 
ject. For a full history of applied mathematics the 
concrete interplay of the interests of mathematicians, 
physicists, engineers, the military, industrialists, politi- 
cians, and other appliers of mathematics would have to 
be analyzed, but this is a task that goes well beyond the 
scope of this article. 

In general, the historical origin of individual notions 
or methods of applied mathematics, which often have 
a history spanning several centuries, will not be traced 
here; pertinent historical information is often included 
in the specialized articles elsewhere in this volume. By 
and large, then, this article will focus on the broader 
methodological trends and the institutional advances 
that have occurred in applied mathematics since the 
early nineteenth century. 

1.2 Periodization 

From the point of view of applications, the history 
of mathematics can be roughly divided into five main 
periods that reveal five qualitatively different levels 
of applied mathematics, the first two of which can 
be considered as belonging to the prehistory of the 
subject. 

(1) ca. 4000 B.C.E.-1400 c.E. Emergence of mathemat- 
ical thinking, and establishment of theoretical math- 
ematics with spontaneous applications. 

(2) ca. 1400-1800. Period of “mixed mathematics” cen- 
tered on the Scientific Revolution of the seventeenth 
century and including “rational mechanics” of the 
eighteenth century (dominated by Euler). 

(3) 1800-1890. Applied mathematics between the In- 
dustrial Revolution and the start of what is often 
called the second industrial (or scientific-technical) 
revolution. Gradual establishment of both the term 


and the notion of “applied mathematics.” France 
and Britain dominate applied mathematics, while 
Germany focuses more on pure. 

(4) 1890-1945. The so-called resurgence of applica- 
tions and increasing internationalization of mathe- 
matics. The rise of new fields of application (elec- 
trical communication, aviation, economics, biology, 
psychology), and the development of new methods, 
particularly those related to mathematical modeling 
and statistics. 

(5) 1945-2000. Moderninternationalized applied math- 
ematics after World War II, inextricably linked with 
industrial mathematics and high-speed digital com- 
puting, led largely by the United States and the Soviet 
Union, the new mathematical superpowers. 

Arguably, one could single out at least two additional 
subperiods of applied mathematics: the eighteenth cen- 
tury, with Euler's “rational mechanics,” and the tech- 
nological revolution of the present age accompanied 
by the rise of computer science since the 1980s. How- 
ever, in the first of these subperiods, which will be 
described in some detail below, mathematics as a dis- 
cipline was still not yet fully established, either institu- 
tionally or with respect to its goals and values, so dis- 
tinguishing between her pure and applied aspects is not 
straightforward. As to the second of the two subperi- 
ods, we believe that these events are so recent that they 
escape an adequate historical description. Moreover, 
World War II had such strong repercussions on math- 
ematics as a whole— particularly on institutionaliza- 
tion (journals, institutes, professionalization), on mate- 
rial underpinning (state funding, computers, industry), 
and not least on the massive migration of mathemati- 
cians to the United States— that it can be considered 
a watershed in the worldwide development of both 
pure and applied mathematics. However, the dramatic 
prediction by James C. Frauenthal— in an editorial of 
SIAM News in 1980 on what he considered the “rev- 
olutionary” change in applied mathematics brought 
about by the invention of the computer— that by 2025 
“in only a few places will there remain centers for 
research in pure mathematics as we know it today” 
seems premature. 

2 Mathematics before 
the Industrial Revolution 

Since the emergence of mathematical thinking around 
4000 B.C.E., through antiquity and up to the start of the 
Renaissance (ca. 1400 C.E.), and embracing the cultures 
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of Mesopotamia, Egypt, and ancient Greece, as well as 
those of China, India, the Americas, and the Islamic- 
Arabic world, applications arose as a result of various 
societal, technical, philosophical, and religious needs. 
Well before the emergence of the Greek notion of a 
mathematical proof around 500 B.C.E., areas of appli- 
cation of mathematics in various cultures included 
accountancy, agricultural surveying, teaching at scribal 
schools, religious ceremonies, and (somewhat later) 
astronomy. Among the methods used were practical 
arithmetic, basic geometry, elementary combinatorics, 
approximations (e.g., tt), and solving quadratic equa- 
tions. Instruments included simple measuring and cal- 
culation devices: measuring rods, compasses, scales, 
knotted ropes, counting rods, abaci, etc. 

The six classical sciences — geometry, arithmetic, as- 
tronomy, music, statics, and optics— existed from the 
time of Greek antiquity and were based on math- 
ematical theory, with the Greek word “mathemata” 
broadly referring to anything teachable and learnable. 
The first four of the classical sciences constituted the 
quadrivium within the Pythagorean-Platonic tradition. 
The theories of musical harmony (as applied arith- 
metic) and astronomy (as applied geometry) can thus 
be considered the two historically earliest branches 
of applied mathematics. The two outstanding applied 
mathematicians of Greek antiquity were Archimedes 
(statics, hydrostatics, mechanics) and Ptolemy (astron- 
omy, optics, geography). Since the early Middle Ages, 
the Computus (Latin for computation)— the calculation 
of the date of Easter in terms of first the Julian calendar 
and later the Gregorian calendar— was considered to be 
the most important computation in Europe. In medieval 
times, particularly from the seventh century, the devel- 
opment of algebraic and calculative techniques and 
of trigonometry in the hands of Islamic and Indian 
mathematicians constituted considerable theoretical 
progress and a basis for further applications, with sig- 
nificant consequences for European mathematics. Par- 
ticularly notable was the Liber Abaci (1202) of Leonardo 
of Pisa (Fibonacci), which heralded the gradual intro- 
duction of the decimal positional system into Europe, 
one of the broadest and most important applications 
of mathematics during the period. Chinese mathemat- 
ics remained more isolated from other cultures at the 
time and is in need of further historical investigation, 
as are some developments within Christian scholastics. 
In spite of their relative fewness and their thematic 
restrictions, we consider the early applications to be a 
deep and historically important root for the emergence 


of theoretical mathematics and not as a mere follow-up 
of the latter. 

From the beginning of the fifteenth century to the 
end of the eighteenth century, applications of mathe- 
matics were successively based on the dissemination 
of the decimal system, the rise of symbolic algebra, 
the theory of perspective, functional thinking (Des- 
cartes’s coordinates), the calculus, and natural philos- 
ophy (physics). The teaching of practical arithmetic, 
including the decimal system, by professional “reck- 
oning masters,” such as the German Adam Ries in the 
sixteenth century, remained on the agenda for several 
centuries. Meanwhile, the first systematic discussion 
of decimal fractions appeared in a book by the Dutch 
engineer Simon Stevin in 1585. 

During this period, and connected to the new de- 
mands of society, there emerged various hybrid disci- 
plines combining elements of mathematics and engi- 
neering: architecture, ballistics, navigation, dynamics, 
hydraulics, and so on. Their origins can be traced back, 
at least in part, to medieval times. For example, partly 
as a result of fourteenth-century scholastic analysis, 
the subject of local motion was separated from the tra- 
ditional philosophical problem of general qualitative 
change, thus becoming a subject of study in its own 
right. 

The term “mixed mathematics” as a catch-all for the 
various hybrid disciplines seems to have been intro- 
duced by the Italian Marsilio Ficino during the fifteenth 
century in his commentary on Plato’s Republic. It was 
first used in English by Francis Bacon in 1605. In his 
Mathematicall Prceface to the first English translation 
of Euclid’s Elements (1570), John Dee set out a “ground- 
plat” or plan of the “sciences and artes mathematicall,” 
which included astronomy and astrology. Due to the 
broad meaning of the original Greek word, the Latin 
name “mathematicus” was used for almost every Euro- 
pean practitioner or artisan within one of these hybrid 
disciplines. As late as 1716, the loose use of “math- 
ematicus” was deplored by the philosopher Christian 
Wolff (a follower of Leibniz) in his Mathematisches Lex- 
icon, an influential dictionary of mathematics, because 
in his opinion it diminished the role of mathematics. 

The emergence of the new Baconian sciences (magne- 
tism, electricity, chemistry, etc.)— which went beyond 
mixed mathematics and were even partially opposed 
to the mathematical spirit of the classical sciences 
(Bacon’s acknowledgment of the future of mixed math- 
ematics, as expressed in the epigraph, was coupled 
with a certain distrust of pure mathematics) — signaled 
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Figure 2 J. H. Zedler, Universallexicon (1731-1754). 
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the rise of systematic experimental methods. The lat- 
ter reached a symbiosis with mixed mathematics in 
the hands of Galileo, Descartes, Huygens, Newton, and 
other pioneers of the Scientific Revolution. In their 
work, the guiding thought was of a fundamental con- 
nection between mathematical and mechanical exact- 
ness. However, a certain division between the tradition 
of mathematics and of the Baconian sciences remained 
palpable until the nineteenth century. 

While talk about “applications of mathematics” was 
common in the English language from at least the sev- 
enteenth century (in, for example, the work of Isaac 
Barrow and others), the term “applied mathematics” 
(with “applied” as an adjective), as the successor to 
“mixed mathematics,” was apparently not introduced 
prior to the eighteenth century. The name seems to 
have appeared first in German as “angebrachte Mathe- 
matik” (“angebracht” having roughly the meaning of 
“applied,” if a bit more in the sense of “attached”) 
in Wolff’s dictionary of 1716. In Latin it appeared 
two years later, in Johann Friedrich Weidler’s textbook 
Institutiones mathematicae (1718), as “Mathesis appli- 
cata quam nonnulli mixta appellant.” In his Univer- 
sallexicon (1731-54), Johann Heinrich Zedler followed 


Figure 3 A. G. Kastner, Anfangsgriinde der 
Angewandten Mathematik (1759). 

Wolff and gave a detailed classification of the parts of 
“angebrachte Mathematik” (figure 2). 

Finally, “applied mathematics” figures for the first 
time on the title page of a book as “Angewandte Mathe- 
matik” in the second volume of Abraham Gotthelf Kast- 
ner’s mathematical textbook Mathematische Anfangs- 
griinde (“mathematical elements”) of 1759 (figure 3). 
The German philosopher Kant used the term “applied 
mathematics” in his Berlin Preisschrift of 1763/64. 
The latter was translated into English in 1798 and, 
according to the Oxford English Dictionary, it is in this 
translation that the term makes its first appearance in 
English. Meanwhile, “mixed mathematics” (“mathema- 
tiques mixtes”) was still in use in French, appearing 
in the famous Encyclopedic of Diderot and d’Alembert 
(1750) (figure 1), and even in the second edition of J.-E. 
Montucla’s Histoire des Mathematiques (1798-1802). 

There is no doubt that the period of the Enlight- 
enment of the eighteenth century deserves a special 
place in a history of applied mathematics, particu- 
larly as a bridge— via so-called rational mechanics, 
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which uses the invention of the calculus (Newton/ 
Leibniz)— between the natural philosophy of the sev- 
enteenth century and the mathematical engineering 
and physics of the nineteenth and twentieth centuries. 
There are six principal actors here: three from the 
famous Swiss Bernoulli family (the brothers Jakob 
(James) and Johann (John), and Daniel, the son of the 
latter), Alexis-Claude Clairaut, Jean d’Alembert, and, by 
far the most prolific of all, Leonhard Euler. The domi- 
nating role of applied mathematics at the time is per- 
haps most convincingly illustrated in the themes of the 
prize competitions run by the prominent academies of 
science, especially the Paris Academy. The topics set in 
Paris included the optimum arrangement of ship masts 
(1727), the motion of the moon (1764/68), the motion 
of the satellites of Jupiter (1766), the three-body prob- 
lem (1770/72), the secular perturbations of the moon 
(1774), the perturbations of comet orbits (1776/78/80), 
the perturbation of the orbit of Pallas (in the begin- 
ning of the nineteenth century), the question of heat 
conduction (1810/12), and the propagation of sound 
waves in liquids (1816). Only occasionally, and mostly 
in a later period, would questions of pure mathematics 
be posed in these competitions, examples being ques- 
tions on the theory of polyhedra (1810) and the Fer- 
mat problem (1816). Euler himself, who won the Paris 
prize twelve times, subscribed to the utilitarian mood 
of Enlightenment when in 1741 he wrote his essay 
“On the utility of higher mathematics,” first published 
only in 1847. According to Goldstine (a pioneer and 
historian of numerical analysis): 

Euler did at least the ground work on virtually ev- 
ery topic in modern numerical analysis. This work 
included the basic notions for the numerical integra- 
tion of differential equations. Moreover, his develop- 
ment of lunar theory made possible the accurate cal- 
culation of the moon’s position and the founding of 
the Nautical Almanac in Great Britain. 

Goldstine (1977) 

A typical example of Euler’s influence can be found 
in Carl Runge’s first two articles on the numerical solu- 
tion of differential equations, which appeared in the 
Mathematische Annalen in 1894 and 1895. The sec- 
ond article connects explicitly to Euler’s Introductio in 
Analysin Infinitorum (1748), in which one finds the first 
example of an approximation by a polygonal chain. Of 
Euler’s more than 800 publications, two-thirds belong 
to mechanics of a varying degree of abstractness. The 
noted and controversially discussed continuum mech- 
anist and historian Clifford Truesdell has traced and 


described the innovations that Euler, d’Alembert, and 
the Bernoullis brought into Newtonian mechanics, in 
terms of both modeling and mathematical methods, 
relying on experience but not undertaking systematic 
experiments. In an apparent allusion to the philosoph- 
ical age of reason, he called these innovations “rational 
mechanics,” a term that was occasionally used at the 
time in a broader sense but by which Truesdell meant 
more specifically mathematical mechanics (Truesdell 
1960). Rational mechanics, as Newton had used the 
term in the preface to his Principia (1687), was one of 
two traditions of mechanics already known in Greek 
antiquity, namely the one that “proceeds accurately by 
demonstration” (the other being practical mechanics). 
For Newton, rational mechanics in this sense was the 
core of natural philosophy. Truesdell opposed the view 
(propounded, for instance, by the positivist philoso- 
pher Ernst Mach) that Euler and like-minded mathe- 
maticians simply systematically applied the new calcu- 
lus of Newton and Leibniz to mechanic and thus did not 
contribute anything substantial to theoretical mechan- 
ics itself. Today, historians usually stress the concep- 
tual progress both in mathematics (e.g., by removing 
certain geometric elements from the Leibnizian calcu- 
lus) and physics (e.g., exploring in detail the relation 
between force and motion) accomplished in the work of 
Euler. It is probably this role of Euler as a “mathematical 
physicist,” combined with the lack of immediate useful- 
ness of Euler’s rational mechanics, that caused Trues- 
dell in 1960 to declare that the latter was not applied 
mathematics. 

Among the new fields of mathematics was, for 
instance, partial differential equations (of which the 
first example appears in a work by Euler of 1 734), which 
were successfully applied by Euler and d'Alembert in 
the second half of the 1740s in the analysis of the 
vibrating string. In his Methodus inveniendi of 1744, 
Euler picked up on the tradition of solving the prob- 
lem of the brachistochrone [IV. 6 §1] in the work of 
the Bernoulli brothers. He set standards in the calculus 
of variations developed later by Joseph Louis Lagrange 
and others. As an extension of Daniel Bernoulli's Hydro- 
dynamica of 1 738, and partly influenced by d’Alembert, 
Euler’s equations of fluid dynamics (1755), which did 
not yet have a term for viscosity, remained a challenge 
for generations of mathematicians to come, not least 
due to their nonlinearity. Euler, like d’Alembert before 
him, was unable to produce from his equations of fluid 
dynamics a single new result fit for comparison with 
experiment. 
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In our opinion (deviating slightly from Truesdell), 
rational mechanics, in hindsight, bears almost all the 
characteristics of applied mathematics in the mod- 
ern sense. At the time, however, in a predominantly 
utilitarian environment, it was the pinnacle of math- 
ematics. It was then rarely counted as mixed mathe- 
matics, notwithstanding some occasional remarks by 
d’Alembert. The term mixed mathematics was more 
frequently used for the mathematically less advanced 
engineering mechanics (Bernard Forest de Belidor and 
Charles-Augustin de Coulomb, etc.) of the time and for 
other fields of application. 

Toward the end of the eighteenth century rational 
mechanics was somewhat narrowed down, both the- 
matically and with respect to possible applications 
(although still including continuum mechanics), by fur- 
ther mathematical formalization, particularly at the 
hands of Lagrange, Euler’s successor in Berlin, whose 
Mechanique Analitique first appeared in 1788. The tow- 
ering figure of Pierre Simon Laplace in Paris — with 
his pioneering work since the late 1770s in celestial 
and terrestrial mechanics and in probability theory- 
foreshadowed much of the important French work 
in applied mathematics, such as that done by Pois- 
son, Fourier, Cauchy, and others in the century that 
followed. To Laplace (generating functions, difference 
equations) and to his great younger contemporary Carl 
Friedrich Gauss in Gottingen (numerical integration, 
elimination, least-squares method) we owe much of 
the foundations of future numerical analysis. Parts 
of their work overlapped (interpolation), while parts 
were supplemented by Adrien-Marie Legendre (least- 
squares method), details of which can be traced from 
Goldstine’s A History of Numerical Analysis. 

3 Applied Mathematics in the 
Nineteenth Century 

Around 1800, in the age of the Industrial Revolution 
and of continued nation building, state funding and 
political and ideological support (revolution in France, 
Neo-Humanism in Germany) led (mainly through teach- 
ing and journals) to a new level of recognition for 
mathematics as a discipline. The older bifurcation of 
pure/mixed mathematics was replaced in France and 
Germany (although not yet in England) by that of 
pure/applied. The difference was mainly that before 
1800 only mixed mathematics together with rational 
mechanics had the support of patrons, while now, 
around 1800, the whole of mathematics was beginning 


to be supported and recognized. Somewhat paradox- 
ically, then, in spite of the general importance of the 
Industrial Revolution as a historical background, it is 
pure mathematics that increasingly gets systematic 
public support for the first time. Indeed, for most 
of the nineteenth century, mathematics would not be 
strongly represented in either engineering or industrial 
environments. 

The foundation of the Ecole Polytechnique (EP) in 
Paris in 1 794 is a good point of reference for the begin- 
ning of our third period. The EP, where military and civil 
engineers were trained, became the leading and “most 
mathematical” institution within a system of techni- 
cal education. This included several “schools of appli- 
cations,” such as the Ecole des Mines and the Ecole 
Nationale des Ponts et Chaussees, to which the stu- 
dents of the EP proceeded. The EP became an exam- 
ple to be emulated by many technical colleges, par- 
ticularly in German-speaking regions, throughout the 
nineteenth century. The most influential mathemati- 
cian in the early history of the EP was Gaspard Monge, 
and it was in accordance with his ideas that mathemat- 
ics became one of the bases of the EP curriculum. In 
1795, in the introduction to his lectures on descriptive 
geometry, the theory that became the “language of the 
engineer” for more than a century, Monge wrote: 

In order to reduce the dependence of the French nation 

on foreign industry one has to direct public education 

to those subjects which require precision. 

Monge’s aspirations for a use of higher mathematics 
in industrial production remained largely unfulfilled at 
the time, except for the use of descriptive geometry. 
However, developments in industry and in educational 
systems led to a stronger focus on the criteria for pre- 
cision and exactitude in the sciences (most notably in 
academic physics) and in engineering, preparing the 
ground for an increased use of mathematics in these 
fields of application at the beginning of the twenti- 
eth century. In fact, it could be argued that it required 
a logical consolidation and a more theoretical phase 
of the development of mathematical analysis before a 
new phase of more sophisticated applied mathematics 
could set in. 

The first concrete institutional confirmation of the 
notion of “applied mathematics” was the appearance 
of the term in the names of journals. Again, the Ger- 
mans were quicker than the French here. Two short- 
lived journals cofounded by the influential combina- 
torialist Carl Friedrich Hindenburg were the Leipziger 
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ANNALES 

DE 

MATHEMATIQUES 

PURES ET APPLIQUEES. 

RECUEIL PERIODIQUE , 

HEDIGE 

Par J. D. GER.GONNE et J. E. THOMAS-LAVERNEDE. 


TOME PREMIER. 


A NISMES , 

DE LIMPRIMERIE DE LA VEUVE BELLE. 

Et se trouve a PARIS, cliez Courcier , Imprimeur -Llbraire pour 
les Mathematiques , quai des Augustins, n.° 5;. 

1810 ET 1811. 

Figure 4 Gergonne’s Annates de Mathematiques 
Pures et Appliquees (1810-11). 

Magazin fur reine und angewandte Mathematik (1786- 
89) and the Archiv fur reine und angewandte Mathe- 
matik (1795-99). A somewhat longer career was had 
by Annales de Mathematiques Pures et Appliquees, 
founded by Joseph Diaz Gergonne in 1810 (figure 4). 
While this journal survived only until 1832, the Ger- 
man Journal fur die reine und angewandte Mathe- 
matik (which, according to the preface by its founder 
August Leopold Crelle in 1826, was largely modeled 
after Gergonne’s journal) is still extant today. This is 
true too of the French Journal de Mathematiques Pures 
et Appliquees, founded by Joseph Liouville in 1836, 
and of the Italian Annali di Matematica Pura ed Appli- 
cata, launched by Francesco Brioschi and Barnaba Tor- 
tolini in Italy in 1858 as an immediate successor to the 
Annali di Scienze Matematiche e Fisiche. On the other 
hand, James Joseph Sylvester’s Quarterly Journal of 
Pure and Applied Mathematics, which was founded in 
1855, survived only until 1927. 

The inclusion of “applied mathematics” in the names 
of these nineteenth-century journals did not necessar- 
ily guarantee a strong representation of applied topics, 


however, either in the journals themselves or in the 
mathematical culture at large. But neither were these 
journals the only outlets for articles on applied topics. 
Journals associated with national academies, such as 
the Philosophical Transactions of the Royal Society, car- 
ried articles on applied topics, while the Philosophical 
Magazine (launched in 1798) was the journal of choice 
for several leading nineteenth-century British applied 
mathematicians . 

This was also the period in which positions explic- 
itly devoted to applications were created at universities. 
In Norway, which had just introduced a constitution 
and was emancipating itself from Danish rule, Christo- 
pher Hansteen's position as “lecturer for applied math- 
ematics” (“Lector i den anvendte Mathematik”) at the 
newly founded university in Christiania was expressly 
justified in May 1814 by “the broad scope of applied 
mathematics and its importance for Norway.” In 1815 
Hansteen was promoted to “Professor Matheseos appli- 
catae.” 

Throughout the nineteenth century, the mathema- 
tization of mechanics continued largely in the tradi- 
tion of Lagrange’s analytical mechanics, with a division 
of labor between physicists and mathematicians such 
as William Rowan Hamilton and Carl Jacobi, arguably 
neglecting some of the topics and insights of Euler’s 
rational mechanics, particularly in continuum mechan- 
ics. However, from the 1820s, although the EP still gave 
preference to analytical mechanics in its courses, there 
were efforts among the professors there, and at the 
more practically oriented French engineering schools 
(“ecoles d’application”), to develop a mechanics for 
the special needs of engineers, a discipline that would 
today be called technical mechanics. The latter drew 
strongly on traditions in mixed mathematics, such as 
the work of de Belidor in hydraulics from the 1730s to 
the 1750s and that of de Coulomb in mechanics and 
electromagnetism from the 1780s onward. It found its 
first energetic proponents in Claude Navier, Jean Victor 
Poncelet, and Gaspard Gustave de Coriolis. 

Around 1820, Poncelet separately developed his pro- 
jective geometry, which became part of the mathemati- 
cally rather sophisticated engineering education at sev- 
eral continental technical colleges. It led to methods 
such as graphical statics, founded by the German- Swiss 
Carl Culmann in the middle of the century, with appli- 
cations in crystallography and civil engineering, the lat- 
ter exemplified by the construction of the Eiffel Tower 
in 1889. Also in the 1820s, influenced by Euler’s hydro- 
dynamics and possibly by Navier's work in engineering, 
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Augustin-Louis Cauchy was the first to base the theory 
of elasticity on a general definition of internal stresses. 
The work of Cauchy, who was at the same time known 
for his efforts to introduce rigor into analysis, under- 
scores the dominance of the French in both pure and 
applied mathematics during the early nineteenth cen- 
tury, with the singular work of Gauss in Gottingen being 
the only notable exception. 

In England the development of both pure and applied 
mathematics during the nineteenth century showed 
marked differences from that in Continental Europe. 
One of the goals of the short-lived Analytical Society 
(1812-19), founded in Cambridge by Charles Babbage 
and others, was to promote Leibnizian calculus over 
Newtonian calculus, or, in Babbage’s words, to promote 
“The Principles of pure D-ism in opposition to the Dot- 
age of the University.” The members of the Analytical 
Society were impressed by the new rigor in analysis 
achieved in France, especially in the work of Lagrange, 
and lobbied for a change in teaching and research in 
Cambridge mathematics, and in particular in the exam- 
inations of the Mathematical Tripos, which were very 
much based on traditional mixed and physical math- 
ematics, as well as on Euclid’s Elements. If anything, 
though, this aspect of the French influence led away 
from applications and toward a gradual purification of 
British mathematics. 

Babbage was impressed by the French mathematical 
tables project directed by Gaspard de Prony at the end 
of the eighteenth century. In a similar vein to Monge 
before him, Babbage pointed to increased competition 
between nations in the age of industrialization, and 
he stressed the need for the development of calculat- 
ing techniques. In On the Economy of Machinery and 
Manufactures (1832) he wrote: 

It is the science of calculation ,— which becomes con- 
tinually more necessary at each step of our progress, 

and which must ultimately govern the whole of the 

applications of science to the arts of life. 

Another (at least indirect) impact of the Industrial 
Revolution on mathematics was the Russian Pafnuty 
Lvovich Chebyshev’s study of James Watt’s steam 
engine, in particular of the “governor," the theory of 
which proved to be a stimulus for the notion of feed- 
back in control theory, and the modern theory of ser- 
vomechanisms. Chebyshev’s interest in the technical 
mechanics of links was also one of the stimuli for 
his studies concerning mathematical approximation 
theory in the 1850s. In addition, he was impressed with 


Poncelet’s technical mechanics. As a result, and due 
to Chebyshev’s great influence within Russian mathe- 
matics, applied mathematics remained much more part 
of mainstream mathematics in Russia during the latter 
half of the nineteenth century than it was in other parts 
of Europe, especially in Germany. 

In the middle of the nineteenth century, the French 
engineering schools, in particular the EP, lost their 
predominant position in mathematics, due to slow 
industrial development in France and problems with 
the overcentralized and elitist educational system. The 
lead was taken by the German-speaking technical col- 
leges (“Technische Hochschulen”) in Prague, Vienna, 
Karlsruhe, and Zurich, in particular with respect to 
the mathematization of the engineering sciences. This 
was true for their emulation of the general axiomatic 
spirit of mathematics even more than for their con- 
cern for the actual mathematical details. Ferdinand 
Redtenbacher (in his analytical machine theory (1852)) 
and Franz Reuleaux (in his kinematics (1875)) aimed at 
“designing invention and construction deductively.” So 
convinced of the important future role of mathematics 
were leading engineers at the Technische Hochschulen 
that they supported the appointment of academi- 
cally trained mathematicians from the classical uni- 
versities. In this way pure mathematicians, such as 
Richard Dedekind, Alfred Clebsch, and later Felix Klein, 
assumed positions at Technische Hochschulen in which 
they were responsible for the education of engineers. 

In parallel, and also from the middle of the nine- 
teenth century, mathematics at the leading German uni- 
versities that did not have engineering departments 
increasingly developed into a pure science, detached 
from practical applications. Supported by the ideol- 
ogy of “Neo-Humanism” within a politically unmod- 
ernized environment, the discipline's educational goal 
(and its legitimation in society) was the training of 
high school teachers, who during their studies were 
often introduced to the frontiers of recent (pure) math- 
ematical research. The result of this was that profes- 
sors at the Technische Hochschulen who were hired 
from the traditional universities were not really pre- 
pared for training engineers. In the long run, the strat- 
egy of appointing university mathematicians backfired 
and this, together with general controversies about the 
social status of technical schools, led to the so-called 
anti-mathematical movement of engineers in Germany 
in the 1890s. 

British and Irish applied mathematics, in the sense 
of mathematical physics, remained strong through the 
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nineteenth century, with work by George Green, George 
Stokes, William Rowan Hamilton, James Clerk Maxwell, 
William Thomson (Lord Kelvin), Lord Rayleigh, William 
Rankine, Oliver Heaviside, Karl Pearson, and others, 
while in the works of James Joseph Sylvester and 
Arthur Cayley in the 1850s, the foundations of mod- 
ern matrix theory were laid. That being said, there was 
no systematic, state-supported technical or engineer- 
ing education in the British system until late in the 
nineteenth century. What was taught in this respect 
at schools and traditional universities was increasingly 
questioned by engineers such as John Perry (see below), 
particularly with respect to the mathematics involved. 
In England the term “mixed mathematics” was occa- 
sionally used interchangeably with “applied mathemat- 
ics” up until the end of the century, a prominent exam- 
ple of this being the tribute by Richard Walker (then 
president of the London Mathematical Society) to Lord 
Rayleigh on winning the society's De Morgan Medal in 
1890. 

Applications of mathematics also featured among 
the activities of the British Association for the Advance- 
ment of Science, which, in 1871, formed a Mathematical 
Tables Committee for both cataloguing and producing 
numerical tables; the committee lasted, with varying 
levels of intensity, until 1948, when the Royal Society 
took over. A prime example of joint enterprise between 
pure and applied mathematicians— the original com- 
mittee consisted of Cayley, Henry Smith, Stokes, and 
Thomson— the project catered for all tastes, its prod- 
ucts including both factor tables and Bessel function 
tables, among others. As J. W. L. Glaisher, the project 
secretary, wrote in 1873: “one of the most valuable uses 
of numerical tables is that they connect mathematics 
and physics, and enable the extension of the former 
to bear fruit practically in aiding the advance of the 
latter.” The project was finally dissolved in 1965, with 
some of the greatest British mathematicians, both pure 
and applied, having been active in its work. 

With the upswing of electrical engineering, more 
sophisticated mathematics (operational calculus, com- 
plex numbers, vectors) finally entered industry in 
around 1890, e.g., through the work of Heaviside in 
England and of the German immigrant Charles P. Stein- 
metz in the United States. Mechanical engineering, on 
the other hand, e.g., in the construction of turbines, 
remained free of advanced mathematics until well into 
the twentieth century. 

It was also not until the end of the nineteenth cen- 
tury that applied mathematics finally began to lose its 


almost exclusive bond to mathematical physics and 
mechanics; new fields of application, new methods 
such as statistics, and new professions such as actuar- 
ial and industrial mathematicians were largely matters 
for the twentieth century. 

This change is nicely captured through the exam- 
ple of the English applied mathematician Karl Pearson. 
In his philosophical book Grammar of Science (1892), 
Pearson, who at the time was mainly known for his work 
on elasticity, defined as the “topic of Applied Mathe- 
matics . . . the process of analyzing inorganic phenom- 
ena by aid of ideal elementary motions.” At the time 
Pearson was already working on biometrics, the sub- 
ject that would lead him to found, together with Fran- 
cis Galton, the journal Biometrika in 1901. Therefore, 
although Pearson was effectively extending the realm 
of applications of mathematics to the statistical analy- 
sis of biological (i.e., organic) phenomena, he appar- 
ently did not consider what he was doing to be applied 
mathematics. 

4 The “Resurgence of Applications” and 
New Developments up to World War II 

From the 1890s, the University of Gottingen (pure) 
mathematician Felix Klein saw the importance of tak- 
ing the diverging interests of the engineering profes- 
sors at technical colleges and those of German univer- 
sity mathematicians into account. Not only did differ- 
ent professions (teaching and engineering) require dif- 
ferent education, but the gradual emergence of indus- 
trial mathematics had to be considered as well. Klein 
recognized the need for reform, including in teach- 
ing at high school level, and he developed Gottingen 
into a center of mathematics and the exact sciences 
(figure 5). Chairs for applied mathematics and applied 
mechanics were created there in 1904, with Carl Runge 
and Ludwig Prandtl being the first appointees. Mean- 
while, from 1901, and under the editorship of Runge, 
the transformation of Zeitschrift fur Mathematik und 
Physik into a journal exclusively for applied mathe- 
matics had begun. These events in Germany, contrast- 
ing with those of the period before, led to talk about 
a “resurgence of applications” (“Wiederhervorkommen 
der Anwendungen”). 

From 1898 and for several decades afterward, the 
famous German multivolume Encyclopedia of the Math- 
ematical Sciences including Their Applications was 
edited by Klein together with Walther von Dyck and 
Arnold Sommerfeld, both from Munich, and others. 
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FELIX KLEIN 

ELEMENTARMATHEMATIK 

VOM HOHEREN standpunkte aus 

DRITTE AUFLAG E 

DRITTER BAND 

PRAZISIONS- und 
APPROXIMATIONSMATIIEMATIK 

AUSGEARBEITET VON 

C. H. MULLER 

FOR den druck fertig gemacht 

UND MIT ZUSATZEN VERSEHEN VON 

FR. SEYFARTH 
MIT 156 ABBILDUNGEN 



BERLIN 

VERLAG VON JULIUS SPRINGER 
1928 

Figure 5 F. Klein, Prdzisions- und Approximationsmathe- 
matik (1928). Posthumous publication of Klein’s 1901 lec- 
tures in Gottingen where he tried to differentiate between 
a mathematics of precision and one of approximation. 

The articles in it, all written in German but includ- 
ing authors from France and Britain, such as Paul 
Painleve and Edmund Taylor Whittaker, contain valu- 
able historical references that are still worth consult- 
ing today. “Applications” as emphasized in the title 
and in the program of the Encyclopedia meant areas 
of application, such as mechanics, electricity, optics, 
and geodesy. The articles were assigned to volumes IV- 
VI, which were in themselves divided into several 
voluminous books each. There were also articles on 
mechanical engineering, such as those by von Mises and 
von Karman. However, topics that would today be clas- 
sified as core subjects of applied mathematics — such 
as numerical calculation (Rudolph Mehmke), differ- 
ence equations (Dmitri Seliwanoff), and interpolation 
and error compensation (both by Julius Bauschinger) — 
appeared as appendices within volume I, which was 


devoted to pure mathematics (arithmetic, algebra, and 
probability). Runge’s contribution on “separation and 
approximation of roots” (1899) was subsumed under 
“algebra.” 

Klein also succeeded in introducing a state examina- 
tion in applied mathematics for mathematics teachers, 
which focused on numerical methods, geodesy, statis- 
tics, and astronomy. In addition, he inspired educa- 
tional reform of mathematics in high schools that he 
designed around the notions of “functional thinking” 
and “intuition,” thereby trying to counteract the overly 
logical and arithmetical tendencies that had until then 
permeated mathematics education. Klein and his allies 
insisted on taking into account international develop- 
ments in teaching and research, for instance by ini- 
tiating a series of comparative international reports 
on mathematical education; these reports in turn led 
to the creation of what has now become the Interna- 
tional Commission on Mathematical Instruction (ICMI). 
The Encyclopedia also provided evidence of the increas- 
ing significance of the international dimension. In his 
“introductory report” in 1904, von Dyck stressed the 
importance for the project of securing foreign authors 
in applied mathematics. Later, a French translation 
of the Encyclopedia began to appear in a consider- 
ably enlarged version, although the project was never 
completed due to the outbreak of World War I. 

Around 1900, reform movements reacting to prob- 
lems in mathematics education similar to those in 
Germany existed in almost all industrialized nations. 
In England, the engineer John Perry had initiated a 
reform of engineering education in the 1890s, and this 
reform played into the ongoing critical discussions of 
the antiquated Cambridge Mathematical Tripos exam- 
inations and their traditional reliance on Euclid. The 
“Perry Movement” was noticed in Germany and in the 
United States. On the pages of Science in 1903, the 
founding father of modern American mathematics, Eli- 
akim Hastings Moore, himself very much a pure math- 
ematician, declared himself to be in “agreement with 
Perry” and proposed a “laboratory method of instruc- 
tion in mathematics and physics.” At about the same 
time (1905), similar ideas “de creer de vrais labora- 
toires de Mathematiques” were proposed by Emile Borel 
in France. In Edinburgh, Whittaker instituted a “Math- 
ematical Laboratory” in 1913 and later, together with 
George Robinson, published the influential The Calcu- 
lus of Observations: A Treatise on Numerical Mathe- 
matics (1924), which derived from Whittaker’s lectures 
given in the Mathematical Laboratory. In Germany, the 
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term “Mathematisches Praktikum” was most often used 
to describe laboratory work in mathematics. Common 
to these efforts was the aim to foster among students 
practical abilities in calculation and drawing, extending 
beyond mere theoretical mathematical knowledge. 

Along with the German Encyclopedia and Whittaker 
and Robinson’s Calculus of Observations, several influ- 
ential textbooks on applied mathematics, on both 
numerical and geometrical methods, appeared in var- 
ious countries at the end of the nineteenth century 
and into the first decades of the new century, prepar- 
ing the ground for future advances. Horace Lamb’s 
Hydrodynamics (1895), an expanded version of his 
1879 text deriving from his Cambridge course, and 
Augustus Love’s Elasticity (1892) became staple fare 
for mathematicians in Britain, each running into sev- 
eral editions. In France Maurice d’Ocagne published 
his Traite de Nomographie (1899), which was initially 
developed for civil engineering (particularly railway 
and bridge construction) but was also used in the 
study of ballistics during Word War I and thereafter. 
Carl Runge’s lectures at Columbia University in New 
York (1909/10) were published as Graphical Methods 
in the United States in 1912, and appeared in a Ger- 
man translation in 1914. Runge’s joint book with Her- 
mann Konig Vorlesungen iiber numerisches Reclmen 
(1924), part of Springer’s Grundlehren der mathema- 
tischen Wissenschaften series, became a classic. Pub- 
lished that same year, and in the same series, Differen- 
zenrechnung by the Dane Niels Erik Norlund became 
“the standard treatise on new aspects of difference 
equations.” This was stated in a report to the Amer- 
ican National Research Council on Numerical Integra- 
tion of Differential Equations (1933), authored by Albert 
Bennett, William Milne, and Harry Bateman, which con- 
tains many valuable references. Cyrus Colton MacDuf- 
fee’s The Theory > of Matrices (1933), also published by 
Springer in Germany, aimed, according to its preface, 
“to unify certain parts of the theory” because “many 
of the fundamental properties of matrices were first 
discovered in the notation of a particular application.” 
Some influential Russian textbooks such as L. V. Kan- 
torovich’s and V. I. Krylov’s Approximate Methods of 
Higher Analysis (1941) became known in the West only 
much later through translations. 

While it was certainly more a chemists’ and physi- 
cists’ war than one of mathematicians, World War 1 
involved mathematicians on all sides. But politicians 
were slow to recognize the specific contribution that 
mathematics could offer to the war effort. France did 


little to protect her most promising young mathemati- 
cians from the front, and about half of the science 
cohort studying at the Ecole Normale Superieure at 
the outbreak of the war was killed. Leading mathe- 
maticians such as Borel, however, gradually became 
involved as scientific advisors in warfare, while ballis- 
tics at the Gavre proving ground was developed under 
Prosper Charbonnier, using graphical and numerical 
methods. In England Ralph Fowler, Edward Milne, and 
Herbert Richmond worked on the mathematics of anti- 
aircraft gunnery, while J. E. Littlewood applied his tal- 
ents to more general ballistics and Karl Pearson inter- 
rupted his statistical research in biomathematics to 
oversee the production of the corresponding range 
tables. In Italy the experiences of Mauro Picone as 
an artillery officer working on ballistics research were 
influential with respect to the formation of the insti- 
tute that was later set up under his lead (see below). 
Meanwhile, a key event in Germany was the founda- 
tion of the Aerodynamische Versuchsanstalt (“Aerody- 
namic Proving Ground”) in Gottingen in 1917, mainly 
financed by the military. This Proving Ground, together 
with the more theoretical Institute for Fluid Mechan- 
ics established in 1925 under the leadership of Lud- 
wig Prandtl, would become the most advanced aerody- 
namic research installation in the world. In the United 
States systematic ballistics research was done in Wash- 
ington under celestial mechanist Forest Ray Moulton 
and at the newly founded Aberdeen Proving Ground 
in Maryland under the geometer Oswald Veblen. In this 
context, finite-difference methods were introduced into 
ballistic computation by Moulton, as were higher meth- 
ods of the calculus of variations by Gilbert Bliss in 
Aberdeen, which even in 1927 were described by Bliss 
as being “on the farther boundary of the explored math- 
ematical domain of today.” From 1917 the new National 
Advisory Committee for Aeronautics, the predecessor 
of NASA, had its Langley Laboratory in Virginia, where 
Prandtl’s former student Max Munk was particularly 
influential in the 1920s when he explained the math- 
ematics of aerodynamics to the laboratory’s engineers 
and built a variable-density wind tunnel. 

More important for applied mathematics than the 
war itself, though, were its consequences. 

The war led to the gradual recognition of the impor- 
tance of the fundamental sciences (including mathe- 
matics), beyond mere technical inventions, for indus- 
trial and military applications. Science systems that 
were under close political control and received their 
main funding from government, such as those in 
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Germany, Italy, and Russia, had the potential to quickly 
recognize and respond to the importance of applied 
mathematics. There is of course nothing guaranteed 
about this relationship, as the example of the rather 
slow development in France shows. In the United States, 
new modes of industrial mass production during the 
war included scientific management (Taylorism). The 
superior strength of material resources, particularly in 
the electrical industry and in the development of cal- 
culating devices— IBM was founded in 1925— pointed 
to the future role of the United States in applied math- 
ematics. Nevertheless, the predominantly private, but 
nonindustrial, sponsorship of the American university 
system before World War II slowed down the arrival of 
academic applied mathematics in the United States. 

The foundation of the institute for applied math- 
ematics at the University of Berlin (in 1920, under 
von Mises) can be partly explained against this back- 
drop of war and international competition. From 1921 
von Mises edited the Zeitschrift fur cmgewandte Mathe- 
matik und Mechanik (ZAMM), which had a clear engi- 
neering context, unlike the older Zeitschrift fur Mathe- 
matik und Physik. In his programmatic article “On the 
tasks and goals of applied mathematics,” with which 
he started his new journal, von Mises gave his own 
definition of applied mathematics and added: 

As a matter of course we put ourselves on the basis 

of the present, particularly on the standpoint of the 

scientifically minded engineer. 

The foundation of the ZAMM was significantly ahead 
of the first Russian journal that expressly (and exclu- 
sively) referred to applications in its title, Prikladnaya 
Matematika i Mekhanika , which began in 1933; it is 
today regularly translated into English as the Journal 
of Applied Mathematics and Mechanics. The first Amer- 
ican journal with “applied mathematics” in its title was 
the Quarterly of Applied Mathematics , edited at Brown 
University from 1943, which was partly modeled on 
von Mises’s ZAMM. However, the editors made a point 
in dropping “mechanics” from the title both in order 
to stress autonomy from the German example and to 
express the new level of independence of the field. 

In spite of, or rather because of, increased eco- 
nomic competition after the war, internationalization 
remained the order of the day even in the applied 
sciences, as promoted by Felix Klein before the war. 
Indeed, the applied mathematician and fluid dynam- 
ics engineer von Karman, who was closely connected 
to Gottingen, initiated the international congresses for 


applied mechanics in 1924 (after organizing a success- 
ful conference on hydrodynamics and aerodynamics in 
Innsbruck in 1 922) because he felt that the critical mass 
for discussion was not big enough on a national level 
and because postwar policies on each side threatened 
communication. Being Hungarian by origin, he was less 
skeptical about restoring international collaboration 
than von Mises and others who had stronger national- 
ist feelings. As von Karman had hoped, the congresses 
attracted participants from across Europe, and by 1930, 
when the congress was held in Stockholm, there was 
also a strong delegation from the United States. How- 
ever, the story of Russian participation is not so good. 
Politics did get in the way, and Russian mathematicians 
barely made an appearance. 

Applied mathematics exhibited particular challenges 
for internationalization, not least because of its engi- 
neering context, which often bore traits of national 
idiosyncrasies, as is clear from the international efforts 
to unify terminology in vector calculus in around 1900. 
As late as 1935, at the Volta congress “High Veloc- 
ities in Aviation” in Rome, terminology in ballistics 
and fluid mechanics was considered to be far from 
internationally standardized. 

On the French side, the foundation of the Institut 
Henri Poincare (1926), on the initiative of Borel and 
funded with American Rockefeller money, was a clear 
token of internationalization. Although the institute 
was less instrumental in the development of applied 
mathematics than its founder, with his broad view of 
“Borelian mathematics” (which stressed links between 
pure mathematics and applications, mainly based on 
mathematical physics and the theory of probability), 
had anticipated and wished for, it did provide the seeds 
for the development of French applied mathematics 
after the war. 

In other European countries, such as Italy and Soviet 
Russia, new efforts to further work on applications 
were launched in the 1920s, partly motivated by expe- 
riences during the war and by political revolutions. 
The foundation in 1921 of the Institute of Physics 
and Mathematics of the Soviet Academy of Sciences by 
Vladimir Steklov (a former student of Aleksandr Lya- 
punov), which was supported by Lenin, was crucial for 
the further development of applied mathematics in the 
Soviet Union; the history of the institute has yet to be 
written. In Italy, under Mussolini, the central event con- 
cerning applications of mathematics in industry and 
the military was the creation in 1927-31 of the Italian 
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Research Council’s National Institute for the Applica- 
tion of the Calculus under Mauro Picone in Naples and 
later in Rome. 

In contrast, Britain created no new institutes for 
mathematics. Even G. H. Hardy, who had left Cambridge 
for Oxford shortly after the war, was unable to per- 
suade his new university to build one. Nevertheless, 
the war left a tangible legacy for applied mathematics. 
Imperial College received a substantial grant to finance 
its Department of Aeronautics, while Cambridge estab- 
lished a new chair in aeronautical engineering. As a 
result of increased funding after the war, establish- 
ments such as the Royal Aircraft Establishment and the 
National Physical Laboratory were able to retain a num- 
ber of their wartime staff, several of whom were math- 
ematicians. Notable inclusions were Hermann Glauert, 
who made a career in aerodynamics at the Royal Air- 
craft Establishment, and Robert Frazer, who worked on 
wing flutter at the National Physical Laboratory. In the 
1930s Frazer and his colleagues W. J. Duncan and A. R. 
Collar were “the first to use matrices in applied math- 
ematics.” In addition, theoreticians and practitioners 
who were brought together because of the war worked 
together afterward. Sometimes, as in the case of the 
Cambridge mathematician Arthur Berry and the aero- 
nautical engineer Leonard Bairstow, the end of hostili- 
ties meant only the end of working in the same location, 
it did not mean the end of collaboration. 

In the interwar period the degree of industrialization 
in a particular country was without doubt one of the 
defining factors in that country’s support of applied 
mathematics. This is well exemplified by the solid 
development of applied mathematics in industrialized 
Czechoslovakia compared with the strong tradition in 
pure mathematics in less industrialized Poland. 

Indeed, it became increasingly obvious after the war 
that engineering mathematics and insurance mathe- 
matics, both of which corresponded to the develop- 
ing needs of the new professions and industries, had 
become legitimate parts of applied mathematics. Not 
only were they the most promising areas of the sub- 
ject, but they were economically the most rewarding. 
Students trained at von Mises’s institute in Berlin and 
at Prandtl's institute in Gottingen found jobs in vari- 
ous aerodynamic laboratories and proving grounds, as 
well as in industry. Von Mises himself both undertook 
governmental assignments and acted as an advisor for 
industry. At Siemens, AEG, and Zeiss (all in Germany), 
the General Electric Company (in Britain), Philips (in the 
Netherlands), and General Electric and Bell Laboratories 


(in the United States) (Millman 1984), industrial labo- 
ratories (mainly in electrical engineering but also, for 
instance, in the optical and aviation industries) devel- 
oped a demand for trained mathematicians. It was the 
study of the propagation of radio waves and of the elec- 
trical devices required to generate them that led in 1920 
to the Dutchman van der Pol working out the equation 
that is to this day considered as the prototype of the 
nonlinear feedback oscillator, van der pol's equation 
[IV.2 §10]) and his modeling approach have repeatedly 
been cited as exemplars for modern applied mathemat- 
ics. Van der Pol's contribution, together with theoret- 
ical work by Henri Poincare on limit cycles, strongly 
influenced Russian work on nonlinear mechanics. Its 
mathematical depth gained the approval (albeit some- 
what reluctant approval) even of Andre Weil, a foremost 
member of the Bourbaki group of French mathemati- 
cians, who in 1950 called it “one of the few interesting 
problems which contemporary physics has suggested 
to mathematics.” Van der Pol, who worked at the Philips 
Laboratories in Eindhoven from 1922, also contributed 
to the justification of the Heaviside operational cal- 
culus in electrical engineering. Around 1929 he used 
integral transformation methods similar to those devel- 
oped before him by the English mathematician Thomas 
Bromwich and the American engineer John Carson at 
Bell Laboratories, who in 1926 wrote the influential 
book Electric Circuit Theory and the Operational Cal- 
culus. Somewhat later, the German Gustav Doetsch pro- 
vided a more systematic justification of Heaviside’s cal- 
culus based on the theory of the Laplace transform 
in his well-received book Theorie und Anwendung der 
Laplace-Transformation (1937). In another influential 
book, Economic Control of Quality of Manufactured 
Product (1931), the physicist Walter Shewart, a col- 
league of Carson’s at Bell Laboratories, was one of the 
first to promote statistics for industrial quality control 
using so-called control charts. 

However, many of these developments in applied 
and industrial mathematics, both in Europe and Amer- 
ica, occurred outside their national academic institu- 
tions, notwithstanding the beginnings of systematic 
academic training in applied mathematics in new insti- 
tutes such as the one led by von Mises. A number of aca- 
demically trained mathematicians and physicists were 
impressed by the spectacular and revolutionary ideas 
of relativity theory and quantum theory, but they were 
slow to recognize the importance of those new appli- 
cations, often in engineering, that relied on classical 
mechanics. 
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This aloofness of academic scientists prevailed in 
the United States. The undisputed leader of American 
mathematicians, George Birkhoff, was aware of that 
when he addressed the American Mathematical Society 
at its semicentennial in 1938 with the following words: 

The field of applied mathematics always will remain of 
the first order of importance inasmuch as it indicates 
those directions of mathematical effort to which nature 
herself has given approval. 

Unfortunately, American mathematicians have shown 
in the last fifty years a disregard for this most authen- 
tically justified field of all. 

There were exceptions, such as Norbert Wiener at the 
Massachusetts Institute of Technology, who was based 
in the mathematics department but interacted with 
the electrical engineering department run by Vannevar 
Bush (the inventor of the “differential analyzer,” an ana- 
logue computer) and through it with Bell Laboratories, 
and there were also the individual efforts of a number 
of mathematicians with a European background. One of 
the most successful of the latter was Harry Bateman, 
Professor of Aeronautical Research and Mathematical 
Physics at Caltech in Pasadena, who became a champion 
of special functions during the 1930s and 1940s, and 
who had earlier (immediately prior to his emigration 
from England in 1910) discussed the Laplace transform 
and applications to differential equations. However, 
American academia was late in recognizing applied 
mathematics, as exemplified by the abovementioned 
report on Numerical Integration of Differential Equa- 
tions (1933), in which the authors write that the report 
was produced “without special grant for relief from 
teaching from any of the institutions represented.” 
Mathematical physicist Warren Weaver (who later, in 
World War II, would lead the Applied Mathematics Panel 
within the American war effort) was surprised, as late 
as 1930, “at the emphasis given, in the discussion [on 
a planned journal for applied mathematics], to the 
field between mathematics and engineering.” During 
the 1920s and 1930s, Rockefeller money had primarily 
been geared toward supporting pure academic mathe- 
matical and physical research, leaving applied research 
in the hands of industry. It was left to the clever nego- 
tiations of Richard Courant (Gottingen’s adherent to 
applied mathematics) to win Rockefeller fellowships 
for the applied candidates under his tutelage, such as 
Wilhelm Cauer and Alwin Walther. 

The 1920s and 1930s were also a time in which math- 
ematical modeling came to the fore, although the term 


“mathematical modeling” was rarely used before World 
War II. In a 1993 article on the emergence of biomath- 
ematics in Science in Context, the author Giorgio Israel 
emphasizes the increasing role of mathematical mod- 
eling in the nonphysical sciences: 

Another important characteristic of the new trends of 
mathematical modeling and applied mathematics is 
interest in the mathematization of the nonphysical sci- 
ences. The 1920s offer in fact an extraordinary concen- 
tration of new research in these fields, which is devel- 
oped from points of view more or less reflecting the 
modeling approach. So the systematic use of math- 
ematics in economics (both in the context of micro- 
economics and game theory) is found in the work of K. 
Menger, J. Von Neumann, O. Morgenstern, and A. Wald, 
starting from 1928. The basic mathematical model of 
the spread of an epidemic (following the research of 
R. Ross on malaria) was published in 1927 [by W. O. 
Kermack and A. G. McKendrick]; the first papers by S. 
Wright, R. A. Fisher and J. B. S. Haldane on mathemati- 
cal theory of population genetics appeared in the early 
twenties; the first contributions of Volterra and Lotka 
to population dynamics and the mathematical theory 
of the struggle for existence were published in 1925 
and 1926; and many isolated contributions (such as 
van der Pol’s model) also appeared in these years. 

Moreover, during the twentieth century there was a 
certain tendency for mathematicians to be less inspired 
by physics and to resort instead to less rigorous or less 
complete models from other sciences, including engi- 
neering. In 1977 Garrett Birkhoff, George Birkhoff s son, 
wrote: 

Engineers and physicists create and adopt mathemat- 
ical models for very different purposes. Physicists are 
looking for universal laws (of “natural philosophy”), 
and want their models to be exact, universally valid, 
and philosophically consistent. Engineers, whose com- 
plex artifacts are usually designed for a limited range 
of operating conditions, are satisfied if their models 
are reasonably realistic under those conditions. On the 
other hand, since their artifacts do not operate in ster- 
ilized laboratories, they must be “robust” with respect 
to changes in many variables. This tends to make engi- 
neering models somewhat fuzzy yet kaleidoscopic. In 
fluid mechanics, Prandtl’s “mixing length” theory and 
von Karman’s theory of “vortex streets” are good exam- 
ples; the “jet streams” and “fronts” of meteorologists 
are others. 

The same author, himself a convert from abstract 
algebra to hydrodynamics, explains resistance to math- 
ematical models in economics, pointing to the fact 
that they did not fit well into Bourbaki’s “conventional 



72 


I. Introduction to Applied Mathematics 


framework of pure mathematics.” The latter has often 
been described as in some respects being inimical to 
applications and (since the “New Math” of the 1960s) 
as being pedagogically disastrous. However, as detailed 
by Israel in the paper quoted above, the relationship 
between Bourbaki and the new practices in modeling 
has not necessarily been negative. Some mathemati- 
cians considered Bourbaki’s notion of mathematics as 
an “abstract scheme of possible realities” to be the right 
way to liberate mathematics from the classical reduc- 
tionist mechanistic approach that had often relied on 
linearization methods. There have even been efforts, 
for instance by logicians, to introduce “planned arti- 
ficial language” into the sciences, as exemplified in 
J. H. Woodger’s The Axiomatic Method in Biology (1937). 
However, these efforts seem to have had limited suc- 
cess. It took another step in the development of com- 
puters in the 1980s before necessarily simplified mod- 
els of biological processes could be abandoned, and 
investigations of cellular automata, membrane com- 
puting, simulation of ecological systems, and simi- 
lar tasks from modern mathematical biology could be 
undertaken. 

During the 1920s and 1930s, many further results 
in different fields of application were obtained. Well- 
known examples include Alan Turing’s work during 
the 1930s on the theory of algorithms and computabil- 
ity, and the Russian Leonid Kantorovich's work on lin- 
ear programming within an economic context (1939), 
which escaped the attention of Western scholars for 
several decades. 

This was also a period in which some of the foun- 
dations were laid for what would, from the late 1940s 
on, be called numerical analysis. In 1928 Courant and 
his students Kurt Friedrichs and Hans Lewy, all three of 
whom eventually emigrated to the United States, pub- 
lished “On the partial difference equations of mathe- 
matical physics” in Mathematische Annalen. The paper 
was translated in the IBM Journal of Research and 
Development as late as 1967 on the grounds that it was 
“one of the most prophetically stimulating develop- 
ments in numerical analysis . . . before the appearance 
of electronic digital computers.... The ideas exposed 
still prevail.” In the history of numerical analysis, 
the paper gained special importance because it con- 
tains the germ of the notion of numerical stability 
and involves the problem of well-posedness of par- 
tial differential equations (as proposed by Hadamard 
in 1902). 


5 Applied Mathematics during and 
after World War II 

World War II, like World War I, was not a mathemati- 
cians’ war. Indeed, in early 1942 the chemist, and Har- 
vard president, James Conant said: “The last was a war 
of chemistry but this one is a war of physics.” This of 
course partially reflected the increasing role of math- 
ematics in World War II, revealed by the use of ballis- 
tics, operations research, statistics, and cryptography 
throughout the conflict. In fact, the president of the 
American National Academy of Sciences, the physicist 
Frank B. Jewett, responded to Conant with the words: 
“It may be a war of physics but the physicists say it 
is a war of mathematics.” However, at the time, due 
to lingering tradition, mathematics was not given the 
same high priority as the other sciences either in the 
preparation for warfare nor in war-related research. In 
the early 1940s within the leading research organiza- 
tions in the United States, in Germany, and in other 
countries, mathematics was still subordinate to other 
fields, such as engineering and physics. In addition, 
the mathematicians themselves were not prepared for 
a new and broader social role, e.g., as professionals in 
industry, such as might be demanded by the war. When 
considering the future of their field during and after 
the war, many pure mathematicians were worried that 
mathematics would suffer from a too utilitarian point 
of view. This is exemplified by the well-known essay A 
Mathematician’s Apology written by the leading English 
mathematician G. H. Hardy in 1940. 

But not long after Hardy’s essay was written, another 
Cambridge mathematician, Alan Turing, demonstrated 
the potential of sophisticated mathematics— a mix of 
logic, number theory, and Bayesian statistics — for war- 
fare, when he and his collaborators at Bletchley Park 
broke the code of the German Enigma machine. 

In Germany, the Diplommathematiker (mathematics 
degree with diploma), which was designed for careers in 
industry and the civil service, was officially introduced 
in 1942, and teaching as a career for mathematicians 
began to lose its monopoly. 

The entry of the United States into World War II in 
December 1941 brought with it deep changes in the 
way mathematicians worked together with industry, 
the military, and government. In the American Math- 
ematical Monthly, rich memoirs on the state of indus- 
trial mathematics and (academic) applied mathematics 
in the United States by Thornton Fry (1941) and Roland 
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Richardson (1943), respectively, were published. Prob- 
ably the most spectacular development in communica- 
tions mathematics took place in the 1940s at Bell Labo- 
ratories with the formulation of information theory by 
Claude Shannon. 

Based on their European experiences, immigrants 
to the United States such as von Karman, Jerzy Ney- 
man, von Neumann, and Courant contributed substan- 
tially to a new kind of collaboration between math- 
ematicians and users of mathematics. In the math- 
ematical war work organized by the Applied Mathe- 
matics Panel, where the leading positions were occu- 
pied by Americans, with Warren Weaver at the head, 
the applied mathematicians cooperated with mathe- 
maticians of an originally purer persuasion, natives of 
the United States (Oswald Veblen, Marston Morse) and 
immigrants (von Neumann) alike. 

As well as their political and administrative expe- 
rience, the immigrants brought to their new environ- 
ment European research traditions from engineering 
mathematics, classical analysis, and discrete mathe- 
matics. Ideas, such as those of von Neumann in theoret- 
ical computing, could gradually mature and material- 
ize within the industrial infrastructure of the United 
States (Bell Laboratories, etc.), aided during the war by 
seemingly unlimited public money (Los Alamos, etc.). 
In March 1945, while the war was still on, von Neu- 
mann sent a famous memo on the “Use of variational 
methods in hydrodynamics” to Veblen. Von Neumann 
recommended the “great virtue of Ritz’s method” and 
deplored that before, and even during, the war mathe- 
matical work had not been sufficiently centralized for a 
systematic attack on the nonlinear equations occurring 
in fluid mechanics and related fields. In the same memo 
von Neumann pointed to the “increasing availability 
of high-power computing devices,” a development to 
which he had of course contributed substantially. As 
mentioned in the introduction to the SLAM “History 
of numerical analysis and scientific computing” Web 
pages: 

Modern numerical analysis can be credibly said to 
begin with the 1947 paper by John von Neumann and 
Herman Goldstine, “Numerical inverting of matrices of 
high order” (Bulletin of the AMS, Nov. 1947). It is one 
of the first papers to study rounding error and include 
discussion of what today is called scientific computing. 

Von Neumann and Goldstine’s results were soon fol- 
lowed up and critically discussed by English mathe- 
maticians (Leslie Fox and James Wilkinson, as well as 


Alan Turing) at the National Physical Laboratory at 
Teddington. 

After the war, the increased level of U.S. federal fund- 
ing for mathematics was maintained. Although partly 
fueled by the beginning of the Cold War, it was nev- 
ertheless no longer restricted to applications. Much of 
it was channeled through the department of defense 
(e.g., by the Office of Naval Research) and the new 
National Science Foundation (NSF), which was founded 
in 1950. The NSF was initiated by the electrical engineer 
Vannevar Bush, who had led the Office for Scientific 
Research and Development, the American war-research 
organization. The concerns about exaggerated utilitar- 
ianism that were harbored by pure mathematicians 
before the war therefore turned out to be groundless. 

As George Dantzig, the creator of the simplex meth- 
od [IV.l 1 §3.1], observed, the outpouring of papers 
in linear programming between 1947 and 1950 coin- 
cided with the building of the first digital computers, 
which made applications in the field possible. Mathe- 
matical approaches to logistics, warehousing, and facil- 
ity location were practiced from at least the 1950s, 
with early results in optimization by Dantzig, William 
Karush, Harold Kuhn, and Albert Tucker being enthu- 
siastically received by (and utilized in the logistics pro- 
grams of) the United States Air Force and the Office of 
Naval Research. These optimization techniques are still 
highly relevant to industry today. 

The papers of the 1950 Symposium on Electromag- 
netic Waves, sponsored by the United States Air Force 
and published in the new journal Communications on 
Pure and Applied Mathematics, summarized the war 
effort in the field. Richard Courant, by then at New 
York University, pointed to the importance of a new 
approach to classical electromagnetism, where “a great 
number of new problems were suggested by engineers.” 
This strengthened the feeling, already evident before 
the war, that the predominance of academic mathe- 
matical physics as the main source of inspiration for 
mathematical applications had begun to wane. Ironi- 
cally, Courant’s paper of 1943 on variational methods 
for the solution of problems of equilibrium and vibra- 
tions, which would later be widely considered to be one 
of the starting points for the finite-element method 
[11.12] (the name being coined by R. W. Clough in 1960), 
lay in obscurity for many years because Courant, not 
being an engineer, did not link the idea to networks of 
discrete elements. Another reason for the later break- 
through was a development of variational methods of 
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approximation in the theory of partial differential equa- 
tions, which would ultimately prove to be vital to the 
development of finite-element methods in the 1960s 
(see J. T. Oden’s chapter in Nash (1990)). A thought- 
ful historical look at the different ways mathematicians 
and engineers use finite-element methods is given in 
Babuska (1994). 

New institutions for mathematical research, both 
pure and applied, were created after the war. Among 
them the institute under Richard Courant at New York 
University developed the most strongly. In a 1954 gov- 
ernment report it was stated that Courant’s institute 
had an 

enrollment of over 400 graduate students in mathe- 
matics of which about half have a physics or engineer- 
ing background.... The next largest figure, reported 
from Brown University, is a whole order of magni- 
tude smaller! In the way of a rough estimate this 
means that New York University alone provides about 
one third of this country’s annual output of applied 
mathematicians with graduate training. 

The figures are based on a questionnaire prepared 
in connection with a conference organized in 1953 at 
Columbia University in New York by F. Joachim Weyl, 
the son of Hermann Weyl, as part of a Survey of Train- 
ing and Research in Applied Mathematics sponsored by 
the American Mathematical Society and by the National 
Research Council under contract with the NSF. The con- 
ference proceedings and the report both included dis- 
cussions on not only the training of applied mathemati- 
cians (particularly for industry, and including inter- 
national comparisons) but also the increasing use of 
electronic computing; a summary was published in the 
Bulletin of the American Mathematical Society in 1954. 

Between 1947 and 1954 the Institute for Numerical 
Analysis at the University of California, Los Angeles, 
sponsored by the National Bureau of Standards, played 
a special role in training university staff in numerical 
analysis and computer operations. The institute was 
closed in 1954, a victim of McCarthyism. 

Brown University’s summer school of applied me- 
chanics, which were organized by Richardson from 
1941 onward, had relied heavily on the contributions 
of immigrants. This is also partly true of the first 
American journal of applied mathematics, the Quar- 
terly of Applied Mathematics, which began in 1943, and 
of Mathematical Tables and Other Aids to Computation, 
another Brown journal, which started the same year 
under Raymond Archibald. 
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Figure 6 M. Abramowitz and I. A. Stegun, eds, 
Handbook of Mathematical Functions (1964). 


Various projects on mathematical tables and special 
functions that had their origins early in the twentieth 
century in various countries (the United Kingdom, Ger- 
many, and the United States) received a boost from 
the war. At Caltech, Arthur Erdelyi, with financial sup- 
port from the Office of Naval Research, oversaw the 
Bateman Manuscript Project— the collation and publi- 
cation of material collected by Harry Bateman, who had 
died in 1946— which led to the three-volume Higher 
Transcendental Functions (1953-55). 

The Mathematical Tables Project, which had been ini- 
tiated by the Works Progress Administration in New 
York in 1938, with Gertrude Blanch as its technical 
director, was disbanded after the war but many of its 
members moved to Washington in 1947 to become part 
of the new National Applied Mathematics Laboratories 
of the National Bureau of Standards. The latter’s confer- 
ence of 1952 resulted in one of the best-selling applied 
mathematics books of all time, Handbook of Mathe- 
matical Functions with Formulas, Graphs and Mathe- 
matical Tables (1964) by Milton Abramowitz and Irene 
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Stegun, both of whom had worked under Blanch (fig- 
ure 6). Meanwhile, Blanch herself, rather than go to 
Washington, took a position at the Institute for Numer- 
ical Analysis at University of California, Los Angeles 
until its closure in 1954, where she worked v\ith John 
Todd and Olga Taussky. Taussky, a rare example of a 
female pure mathematician who made contributions 
to applied mathematics, had worked during the war 
in Frazer’s group at the National Physical Laboratory 
on flutter problems in supersonic aircraft and was an 
important figure in the development of matrix theory. 
Aware of certain resentments against applied math- 
ematics from some mathematical quarters, in 1988 
Taussky said: 

When people look down on matrices, remind them of 
great mathematicians such as Frobenius, Schur, C. L. 
Siegel, Ostrowski, Motzkin, Kac, etc., who made impor- 
tant contributions to the subject. I am proud to have 
been a torchbearer for matrix theory, and I am happy 
to see that there are many others to whom the torch 
can be passed. 

Computer developments rapidly changed the char- 
acter of the table-making aspect of applied mathemat- 
ics. As a result, Archibald’s journal had a rather short 
life in its original form, being renamed Mathematics 
of Computation in 1960 in order “to reflect the broad- 
ened scope of the journal,” which, as the editor Harry 
Polachek described, had “expanded to meet the need in 
[the United States] for a publication devoted to numeri- 
cal analysis and computation.” Private and government 
institutes such as the Research and Development Cor- 
poration (RAND), founded within the Douglas Aircraft 
Company in October 1945, and professional organiza- 
tions such as the Association for Computing Machin- 
ery (1947), the Operations Research Society of Amer- 
ica (1952), the Society for Industrial and Applied Math- 
ematics (1952), and the Institute of Management Sci- 
ences (1953) all testified to the broadening social base 
for applied mathematics. 

The technological development of computers and 
software had ramifications for the development of 
mathematical algorithms, particularly with respect to 
their speed and reliability. In his treatment of ordi- 
nary differential equations around 1910, Carl Runge 
had no need of systematic error estimation — it was not 
until after World War II that John Couch Adams's mul- 
tistep methods for the numerical solution of ordinary 
differential equations from the mid-nineteenth century 
were analyzed with respect to error estimation, after 


they had been brought into a more practical form by 
W. E. Milne in the United States (“Milne device”) in 1926. 
However, postwar increases in computing speed led to 
greater concern about numerical stability and to the 
abandonment of methods like those of Milne. Euro- 
peans were also strongly involved in these theoretical 
developments. For example, in the 1930s the English- 
man Richard Southwell had developed the so-called 
relaxation method, an iterative method connected to 
the numerical solution of partial differential equations 
in elasticity theory. The story goes that Garrett Birkhoff 
gave his doctoral student David M. Young the task of 
“automating relaxation,” and Young introduced suc- 
cessive overrelaxation in his 1950 thesis (see Young’s 
chapter in Nash (1990)). Young’s deep analysis showed 
how to choose the relaxation parameters automati- 
cally in some important cases and thus provided a 
method suitable for programming on digital comput- 
ers. This was an important advance on Southwell’s 
method, which had been designed for computation by 
hand. The Norwegian-German Werner Romberg's elim- 
ination approach (1955) for improving the accuracy 
of the trapezoidal rule relied on Richardson extrapo- 
lation, a technique developed by the Englishman Lewis 
Fry Richardson early in the century. Richardson intro- 
duced Unite-difference methods for the numerical solu- 
tion of partial differential equations in 1910, of which 
extrapolation is just one part. And he was the first to 
apply mathematics, in particular the method of finite 
differences, to weather prediction [V.18]. His book 
Weather Prediction by Numerical Process (1922) was 
republished by Cambridge University Press in 2007 
with a new foreword. The Richardson-Romberg proce- 
dure became widely known after it had been subjected 
to a rigorous error analysis by the German Friedrich L. 
Bauer and the Swiss Eduard L. Stiefel in the 1960s. 
In 1956 the Swede Germund Dahlquist introduced, 
within his theory of numerical instability, what Nick 
Trefethen (in The Princeton Companion to Mathematics) 
called “the fundamental theorem of numerical analysis: 
consistency + stability = convergence.” 

A major step in understanding the effects of round- 
ing errors in linear algebra algorithms such as Gauss- 
ian elimination was the development, principally by 
the Englishman James Wilkinson, of backward error 
analysis [1.2 §23], as described in Wilkinson’s influ- 
ential 1963 and 1965 research monographs. The all- 
important role of computer developments for the prac- 
tical realization of previously existing theory is also 
evident in the history of the fast Fourier transform 
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[II. 10] (FFT), made famous by Cooley and Tukey in 1965: 
the FFT was originally discovered by Gauss but it could 
not be effectively applied before the advent of digital 
computers. 

Many of the European mathematicians mentioned 
above worked in the United States for some of their 
careers. The “Brain Drain” gradually replaced the flight 
from Europe for political reasons, as observed in 1953 
by the Dutch mathematician A. van Wijngaarden when 
attending a Columbia University symposium: 

The one thing that continues to worry us is that we lose 

many of our best people to that big sink of European 

scientific talent, the U.S.A. 

However, there were many European mathematicians 
who did not (or could not) follow the Brain Drain and 
who worked on an equal level with the Americans, 
among them many Russians. Nevertheless, there is no 
doubt about the superior technological and industrial 
infrastructure, particularly with respect to computing 
facilities and software development, that existed in 
the United States. Although this superiority was some- 
times met with resentment, including by Alan Tur- 
ing, it was admitted by Russians such as A. P. Ershov 
and M. R. Shura-Bura (Metropolis et al. 1980) and by 
Western European applied mathematicians such as the 
Frenchmen Louis Couffignal, the cybernetics pioneer, 
and Jacques-Louis Lions, the numerical analyst. For 
this reason, the latter cooperated strongly with Russian 
applied mathematicians, and he sometimes felt that the 
lack of access to cutting-edge technology increased the 
theoretical depth of their collaborative work. 

From the 1930s Russian mathematicians had begun 
publishing exclusively in Russian and no longer also 
published in French and German. This practice, which 
continued after the war, prompted the American Math- 
ematical Society, with funding from the Office of Naval 
Research, to begin a Russian translation project in 
1947. SIAM followed suit in 1956, with support from 
the NSF. The “Sputnik crisis” in 1957 caused American 
mathematicians to look even more closely at the work 
being done in the Soviet Union. The Russian-born Amer- 
ican topologist Solomon Lefschetz, who was prompted 
out of retirement by the event, persuaded the Martin 
Aircraft Company to set up a mathematics research 
center, at which he directed a large group working on 
nonlinear differential equations. 

And yet during the two decades following the war, 
despite the relatively favorable material conditions of 
the United States, applied mathematics lost some of 


its reputation in comparison with pure mathematics. 
In retrospect, of course, this period can be seen as one 
of consolidation of methods and of waiting for more 
powerful computational technology and for deeper 
theoretical foundations to arrive. 

For example, one could argue that parts of the reform 
movement in teaching (“New Math”) in the early 1960s, 
which was meant to correct obvious and perceived 
shortcomings in American secondary education, coun- 
teracted efforts to develop applied mathematics. This 
movement was strongly influenced by the “purist” ide- 
ology of Bourbaki and had worldwide ramifications, 
threatening to undermine the development of pupils’ 
attitudes toward applications. In 1961, at a symposium 
organized by SIAM in Washington, DC, on research 
and education in applied mathematics, the chairman 
deemed that 

applied mathematics is something of a stepchild; I 
might even say an out-of-step child, whose creations 
are looked upon with equal disinterest by mathemati- 
cians, physicists and engineers. 

SIAM Review (1961) 

On January 24 of the same year, the New York 
Times ran a column entitled, “Russia may be losing 
its traditional leadership in mathematics because of 
overemphasis on applied research,” which prompted 
the applied mathematician Harvey Greenspan from MIT 
to make a critical response in American Mathematical 
Monthly saying that, to the contrary, 

the increased Russian emphasis in this area is really 
cause for some serious concern on our part.... The 
present system does not produce adequate numbers 
of applied mathematicians and remedial steps must 
soon be taken. The great majority of the senior fac- 
ulty in applied mathematics are products of a European 
education. 

The real breakthrough for applied mathematics in 
the United States came, however, with the rise of com- 
puter technology and computer science in the 1970s 
and 1980s, when, at the same time, the ideology of 
Bourbaki was in retreat. The so-called Davis Report of 
1984 on “Renewing U.S. Mathematics” was an impor- 
tant event, leading to a near doubling of federal invest- 
ment in mathematical research and including a spe- 
cial mathematics of computing initiative. In 1989 the 
Hungarian-born Peter Lax— who emigrated v\4th his par- 
ents to the United States in 1941 and studied at New 
York University, and who in 2005 was a recipient of 
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the Abel Prize in Mathematics “for his groundbreak- 
ing contributions to the theory and application of par- 
tial differential equations and to the computation of 
their solutions” — signaled a new level of acceptance for 
applied mathematics in his widely distributed paper in 
SIAM Review on “The flowering of applied mathematics 
in America”: 

Whereas in the not so distant past a mathematician 
asserting “applied mathematics is bad mathematics” 
or “the best applied mathematics is pure mathemat- 
ics” could count on a measure of assent and applause, 
today a person making such statements would be 
regarded as ignorant. 

The publication of articles by working applied math- 
ematicians in Metropolis et al. (1980) and Nash (1990), 
and the extensive seven-volume historical project un- 
dertaken by the Journal of Computational and Applied 
Mathematics (2000), which was republished in Brezin- 
ski and Wuytack (2001), seem to testify to a growing 
self-confidence of applied mathematics within math- 
ematics more widely. Some efforts, such as the pub- 
lication of the “Top ten algorithms” in Computing in 
Science and Engineering in 2000 (six of which are con- 
tained in [1.5, table 2]), provoked controversial discus- 
sions. Many practitioners of applied mathematics in 
these and other publications reveal awareness of prob- 
lems regarding the rigor and reliability of their meth- 
ods, showing that the links between pure and applied 
mathematics exist and continue to stimulate the field. 
However, the standard philosophical approaches to 
mathematics — circling repetitively around formalism, 
logicism, and intuitionism, with no consideration of 
applications and doing no justice to the ever-increasing 
range of mathematical practice— are no longer satisfy- 
ing either to mathematicians or to the public. 

The unabated loyalty to pure mathematics as the 
mother discipline sometimes leads to overcautious 
reflection on the part of the applied mathematician. 
A nice example is provided by Trefethen (again in 
The Princeton Companion to Mathematics) in the con- 
text of rounding errors and the problem of numerical 
stability: 

These men, including von Neumann, Wilkinson, For- 
sythe, and Henrici, took great pains to publicize the 
risks of careless reliance on machine arithmetic. These 
risks are very real, but the message was communicated 
all too successfully, leading to the current widespread 
impression that the main business of numerical analy- 
sis is coping with rounding errors. In fact, the main 
business of numerical analysis is designing algorithms 


that converge quickly; rounding-error analysis, while 
often a part of the discussion, is rarely the central 
issue. If rounding errors vanished, 90% of numerical 
analysis would remain. 

But it is not only these methodological concerns 
that hold back applied mathematics. For example, Lax 
(1989) mentions persisting problems in education and 
the need to maintain training in classical analysis: 

The applied point of view is essential for the much- 
needed reform of the undergraduate curriculum, espe- 
cially its sorest spot, calculus. The teaching of calculus 
has been in the doldrums ever since research math- 
ematicians gave up responsibility for undergraduate 
courses. 

The education and training of applied mathematicians 
remains a central concern, and it is not even clear 
whether the situation has changed significantly since 
the Columbia University Conference of 1953. At that 
time the applied mathematician and statistician John 
Wilder Tukey, best known for the development of the 
FFT algorithm and the box plot, declared with reference 
to what is now called modeling: 

Formulation is the most important part of applied 
mathematics, yet no one has started to work on the 
theory of formulation— if we had one, perhaps we 
could teach applied mathematics. 

A 1998 report by the NSF states that: 

Careers in mathematics have become less attractive 
to U.S. students. [Several] . . . factors contribute to this 
change: (i) students mistakenly believe that the only 
jobs available are collegiate teaching jobs, a job mar- 
ket which is saturated (more than 1,100 new Ph.D.s 
compete for approximately 600 academic tenure-track 
openings each year); (ii) academic training in the 
mathematical sciences tends to be narrow and to 
leave students poorly prepared for careers outside 
academia; (iii) neither students nor faculty understand 
the kinds of positions available outside academia to 
those trained in the mathematical sciences. 

The same report underscores the undiminished depen- 
dence of American pure and applied mathematics on 
immigration from Europe and (now) from Asia, South 
America, and elsewhere: 

Although the United States is the strongest national 
community in the mathematical sciences, this strength 
is somewhat fragile. If one took into account only 
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home-grown experts, the United States would be weak- 
er than Western Europe. Interest by native-born Amer- 
icans in the mathematical sciences has been steadily 
declining. Many of the strongest U.S. mathematicians 
were trained outside the United States and even more 
are not native born. A very large number of them 
emigrated from the former Soviet Union following its 
collapse. (Russia’s strength in mathematics has been 
greatly weakened with the disappearance of research 
funding and the exodus of most of its leading mathe- 
maticians.) Western Europe is nearly as strong in math- 
ematics as the United States, and leads in important 
areas. It has also benefited by the presence of emigre 
Soviet mathematical scientists. 

The Fields Medals for Pierre-Louis Lions (son of 
Jacques-Louis Lions) (1994), Jean-Christophe Yoccoz 
(1994), Stanislav Smirnov (2010), and Cedric Villani 
(2010) testify to the growing strength of European 
applied mathematics and to the changed status of the 
field within mathematics. Likewise, the awarding of the 
Abel Prize of the Norwegian Academy of Science and 
Letters to Peter Lax (2005), Srinivasa Varadhan (2007), 
and Endre Szemeredi (2012) for predominantly applied 
topics is a further indication of this shift. In addi- 
tion, prestigious prizes devoted specifically to applica- 
tions, with particular emphasis on connections to tech- 
nological developments, have been founded in recent 
decades. The fact that several of these prizes have been 
named for mathematicians of outstanding theoretical 
ability— the ACM A. M. Turing Award (starting in 1966), 
the IMU Rolf Nevanlinna Prize (1981), the DMV and 
IMU Carl Friedrich Gauss Prize (2006)— underscores the 
unity of mathematics in its pure and applied aspects. 

Meanwhile, problems remain in the academic-indus- 
trial relationship and, connected to it, in the profession- 
al image of the applied mathematician, as described in 
the two most recent reports on “Mathematics in Indus- 
try” (1996 and 2012) published by SLAM. The report for 
2012 summarizes the situation: 

Industrial mathematics is a specialty with a curious 
case of double invisibility. In the academic world, 
it is invisible because so few academic mathemati- 
cians actively engage in work on industrial problems. 
Research in industrial mathematics may not find its 
way into standard research journals, often because the 
companies where it is conducted do not want it to. 
(Some companies encourage publication and others do 
not; policies vary widely.) And advisors of graduates 
who go into industry may not keep track of them as 
closely as they keep track of their students who stay in 
academia. 


However, most of the problems mentioned in this 
article with respect to academic applied mathematics 
(research funding, the lack of applications in math- 
ematics education, the need for migration between 
national cultures) concern pure and applied math- 
ematics alike. On the purely cognitive and theoret- 
ical level, the difference between the two aspects of 
mathematics — for all its interesting and important his- 
torical and sociological dimensions — hardly exists, as 
the above-quoted NSF report of 1998 underscores: 

Nowadays all mathematics is being applied, so the term 
applied mathematics should be viewed as a different 
cross cut of the discipline. 
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Part II 
Concepts 


II. 1 Asymptotics 

P. A. Martin 


When sketching the graph of a function, y = f(x), 
we may notice (or look for) lines that the graph 
approaches, often as x — ±oo. For example, the graph 
of y = x 2 / (x 2 + 1 ) approaches the straight line y = 1 
as x — ■ oo (and as x -* — oo). This line is called an asymp- 
tote. Asymptotes need not be horizontal or straight, 
and they may be approached as x — xo for some 
finite xq. For example, y = x 4 /(x 2 + 1) approaches the 
parabola y = x 2 asx — ±oo, andy = logx approaches 
the vertical line x = 0 as x — 0 through positive val- 
ues. Another example is that sinhx = |(e x + e~ x ) 
approaches as x ->■ oo: we say that sinhx grows 
exponentially with x. 

The qualitative notions exemplified above can be 
made much more quantitative. One feature that we 
want to retain when we say something like “y = f(x) 
approaches y = gix) as x — oo” is that, to be use- 
ful, g(x) should be simpler than /(x), where “simpler” 
will depend on the context. This is a familiar idea; for 
example, we can approximate a smooth curve near a 
chosen point on the curve by the tangent line through 
that point. 

When lim x -x 0 \f(x)/g(x)] = 1, we write fix) ~ 
g(x) as x — xo, and we say that g(x) is an asymp- 
totic approximation to fix) as x — ■ xo. For example, 
sinhx ~ x as x — ■ 0 and tanhx ~ 1 as x — ■ oo. A 
famous asymptotic approximation of this kind is Stir- 
ling’s formula from 1730: n\ ~ (2rru) 1/2 (n/e)” as 

n — • oo. 

According to our definition, we have e x ~ 1, e x ~ 1 + 
x, and e x ~ 1 + 2x, all as x — 0. On the other hand, we 
have the Maclaurin expansion, e x = 1 + x + jx 2 + ■■■ , 
which converges for all x; truncating this infinite series 
gives good approximations to e x near x = 0, and these 
approximations improve if we take more terms in the 
series. This suggests that we should select 1 + x and not 


1 + 2x, so our definition of is too crude. We want 
asymptotic approximations to be approximations, and 
we want to be able to improve them by taking more 
terms, if possible. With this in mind, suppose we have 
a sequence of functions, <p n ix), n = 0,1,2,..., with 
the property that <p n+ iix) /<p n ix) — ■ 0 as x — ■ xo. 
Standard examples are 4> n ix) = x n as x — ■ 0 and 

4>nix) = X n as X - CO. Let R N ix) = En=0 a n4>nix) 
for some coefficients a n . We write 

00 

fix) ~ X a-nfnix) as X - x 0 , 
n= 0 

and say that the series is an asymptotic expansion of 
fix) as x — ■ xo when, for each N = 0,1,2,..., 

[fix) - R N ix)]/<j) N ix) - 0 asx = xq. (1) 


In words, the “error” / - Rn is comparable to the first 
term omitted, the one with n = N + 1. Note that the 
definition does not require the infinite series to be con- 
vergent (so that R N ix) may not have a limit as N — ■ oo 
for fixed x). Instead, for each fixed JV, we impose a 
requirement on the error as x — xo, namely (1). 

Asymptotic approximations may be convergent. For 
example, we have e x ~ 1 + x + \x 2 + ■ ■ • as x — 0. How- 
ever, many interesting and useful asymptotic expan- 
sions are divergent. As an example, the complementary 
error function 

7 r°° , 

erfc(x) = — = e~ t_ di 

V TT Jx 


e ^ 
xJtt 


1+ S(-l) f 

n = 1 


3 ■ ■ ■ (2n - 1) " 
(2x 2 )” 


asx= oo , where the series is obtained by repeated inte- 
gration by parts of the defining integral. The series is 
divergent, but taking a few terms gives a good approxi- 
mation to erfc x, an approximation that improves as x 
becomes larger. 

Many techniques have been devised for obtaining 
asymptotic expansions. Some are designed for func- 
tions defined by integrals (such as erfcx), others for 
functions that solve differential equations. Asymptotic 
methods can also be used to estimate the complexity 
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of an algorithm or to predict properties of large com- 
binatorial structures and complex networks. 

In many cases of interest, the function to be esti- 
mated depends on parameters. Their presence may be 
benign, but complications can arise. Suppose we are 
interested in /(x, A) when both x and the parameter 
A are large. Using standard methods, we may be able 
to show that /(x, A) ~ g(x. A) as x — ■ oo for fixed 
A, and that /(x, A) ~ h(x, A) as A — ■ oo for fixed x. 
Then, the natural question to ask is: if we estimate 
g(x, A) for large A and h(x. A) for large x, do we get the 
same answer? Simpler questions of this kind arise when 
investigating the commutativity of limits: given that 

L i = lim limF(x,y), I 2 = Um limf(x,y), 

X— 003/ — 00 y-> oox—oo 

when does L\ = lo? In our context, with, for example, 
/(x, A) = (x + A)/(x + 2A), / ~ 1 as x — ■ 00 but f ~ \ 
as A 00 , so standard asymptotic approximations 
may fail when there are two or more variables. In 
these situations, uniform asymptotic expansions are 
needed. Inevitably, these are more complicated, but 
they are often needed to gain a full understanding of 
certain physical phenomena. For example, consider the 
shadow of an illuminated sphere on a flat screen. It is 
a dark circular disk. However, close inspection shows 
that the shadow boundary is not sharp: the image 
changes rapidly but smoothly as we cross the circular 
boundary. Standard asymptotic techniques explain 
what is happening in the illuminated and shadow 
regions (the wavelength of light is much smaller than 
the radius of the sphere, so their small ratio can be 
exploited in building asymptotic approximations), but 
uniform asymptotic approximations are needed to 
explain the transition between the two. 

For more on asymptotic methods, see the articles on 
PERTURBATION THEORY AND ASYMPTOTICS [IV.5], SPE- 
CIAL functions [IV. 7], and divergent series: taming 
the tails [V.8]. Another good source is chapter 2 of 
NIST Handbook of Mathematical Functions (Cambridge 
University Press, Cambridge, 2010), edited by F. W. J. 
Olver, D. W. Lozier, R. F. Boisvert, and C. W. Clark. (An 
electronic version of the book is available at http:// 
dlmf.nist.gov.) 


II.2 Boundary Layer 

The mathematical idea of a boundary layer emerged 
from an analysis of a physical problem arising in fluid 
dynamics [IV.28]; the flow of a viscous fluid past a rigid 
body. Suppose that the body has diameter L and U is 


the speed of the flow far from the body. The Reynolds 
number for the flow is Re = UL/v, where v is the 
dynamic viscosity coefficient. Assume that the fluid is 
not very viscous (meaning that Re » 1) so that the 
flow can be well approximated by solving the governing 
equations for a nonviscous fluid, the euler equations 
[III. 11]. This approximation is not goodnear the surface 
of the body. There, viscous effects are important, so the 
navier-stokes equations [III.23] must be used. Nev- 
ertheless, good approximations can be constructed by 
exploiting the fact that Re » 1. This is done within 
a thin viscous boundary layer. The two solutions, one 
in the boundary layer and one in the outer region, are 
matched in a region where both are assumed to be 
valid, and we therefore obtain a good approximation 
everywhere in the fluid. This whole scheme was first 
described and implemented by Ludwig Prandtl in 1904. 

The idea of joining two solutions together is part of 
an extensive suite of asymptotic techniques in which 
large (or small) dimensionless parameters are identi- 
fied and exploited. There can be more than two regions 
and more than one large parameter. Sophisticated 
matching procedures have been developed and applied 
to many complicated problems, not just those arising 
in fluid dynamics. See the article on perturbation 
THEORY AND ASYMPTOTICS [IV.5]. 


II. 3 Chaos and Ergodicity 

Paul Glendinning 


The term chaos is used with greater or lesser degrees 
of precision to describe deterministic dynamics that 
is nonperiodic and unpredictable. The word was first 
used in this context by Li and Yorke in the title of 
their article “Period three implies chaos” in 1975. Li and 
Yorke were careful not to define the term too narrowly 
and described their mathematical results using clearly 
defined properties of solutions. In one of the early 
textbooks on the subject (published in 1989), Devaney 
defined chaos for discrete maps T on a metric space 
X with x n +i = T(x„), x n e X. Devaney’s definition is 
that T : X — ■ X is chaotic on X if 

(i) periodic orbits are dense in X, 

(ii) the dynamics is transitive (so for any points x, y e 
X it is possible to find a point in X that is arbitrar- 
ily close to x whose orbit passes arbitrarily close 
to y), and 

(iii) the system has sensitive dependence on initial 
conditions ( SDIC ) on X. 
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SDIC means that there exists a precision e > 0 such 
that for every x & X and 5 > 0 there exists a point 
y £ X within a distance 5 of x and a time n such that 
solutions started at x and y are at least a distance e 
apart after time n. SDIC is often described as imply- 
ing a loss of predictability, or a loss of memory of ini- 
tial conditions, although it is a little more complicated 
than this suggests. Devaney’s definition was intended 
to make it possible to ask sensible examination ques- 
tions about chaos at undergraduate level, and by 1992 
two groups had shown that (for appropriate systems) 
the first two conditions imply the third, suggesting it is 
not an ideal definition! 

Another commonly used definition of chaos uses the 
idea of Lyapunov exponents. These measure the asymp- 
totic local expansion properties of solutions, so if (x n ) 
is an orbit of the one-dimensional map x n +i = T(x n ) 
and T is differentiable, then the Lyapunov exponent of 
xo is 

^ n-1 

A(xo) = lim - £ log |T'(xjt)l 

provided the limit exists. The dynamics is regarded as 
chaotic if it is not “simple” (a union of periodic points, 
for example) and typical points have positive Lyapunov 
exponents. 

Issues around definition notwithstanding, the idea of 
chaos has radically changed the choices that applied 
mathematicians make when modeling systems with 
complicated temporal behavior. In particular, chaos 
theory shows that apparently random behavior may 
have a deterministic origin (see also dynamical sys- 
tems [IV. 20] and the lorenz equations [111.20]). 

The analogy v\ith probability theory is made mathe- 
matically precise using ergodic theory, which connects 
time averages of a quantity along an orbit with an 
expected value. This expected value is a spatial integral, 
and hence it is independent of the initial conditions. 

The definition of ergodicity uses two ideas: invari- 
ant sets and invariant measures. An invariant set X of 
a map T satisfies X = T _1 (X), and a map T with an 
invariant set X has an invariant probability measure if 
there is a measure p such that p(S) = p(T~ l (S)) for all 
measurable sets S in X and p(X) = 1. 

The map T is ergodic with respect to an invariant 
measure p if p(S) is either zero or one for all invari- 
ant sets S. In other words, every invariant set is either 
very small (measure zero) or it is essentially (up to 
sets of measure zero) the whole set X. This means that 
the dynamics is not decomposable into smaller sets, 


a role played by the transitive property in Devaney’s 
definition of chaos. 

Ergodic maps have nice averaging properties. For all 
integrable functions / on X the time average of / along 
typical orbits equals the spatial average obtained by 
integrating X with respect to the invariant measure: 

lim — X f(T k (x)) = f fdp 

for almost all x e X (with respect to the invariant mea- 
sure p). Thus p can be interpreted as a probability mea- 
sure of the dynamics, and if dp = h(x) dx, then h can 
be interpreted as a sort of probability density function 
for the iterates of points under T. The “typical” Lya- 
punov exponent of a one-dimensional map can then be 
interpreted as the integral of / = log j T' | with respect 
to the invariant measure. 

Further Reading 

Devaney, R. L. 1989. An Introduction to Chaotic Dynamical 
Systems. Reading, MA: Addison-Wesley. 

Li, T.-Y., and J. A. Yorke. 1975. Period three implies chaos. 

American Mathematical Monthly 82:985-92. 

Walters, P. 1982. An Introduction to Ergodic Theory. New 
York: Springer. 


II.4 Complex Systems 

Paul Glendinning 


Complex systems are dynamical models that involve 
the interaction of many components. The study of 
these systems is called complexity theory. Complex 
systems are characterized by having large dimension 
and complicated interactions. 

There are many examples of complex systems across 
the sciences. The reactions between chemicals in a liv- 
ing cell can be considered as a complex system, where 
the variables are the concentrations of chemicals and 
the dynamics is defined by the chemical reactions that 
can take place. A chemical can react with only a small 
set of the chemicals present, so the connectivity is 
generally low. 

Other examples include the Internet, where individ- 
ual computers are linked together in complex ways, and 
electronic systems whose components are connected 
on a circuit board. Agent-based models of human 
behavior such as opinion formation within groups or 
the movement of crowds are also complex systems. 
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It is often hard to do much more than numerical sim- 
ulations to determine the behavior of systems, lead- 
ing to a somewhat limited phenomenological descrip- 
tion of their behavior. There are two central meth- 
ods in the study of complex systems that go further 
than this, though again with limited concrete predic- 
tive power. These are graph theory to characterize 
how components influence each other, and dimension 
reduction methods to capture (where applicable) any 
lower-dimensional approximations that determine the 
evolution of the system. 

If the system has real variables x,, i = 1 N, then 

each variable can be identified with the node of a graph 
labeled by i, with an edge from i to j if the dynamics 
of Xj is directly influenced by x* (see graph theory 
[11.16]). For example, if the evolution is determined by a 
differential equation then X; = /,(x i,X2, . . . ,xjv), but 
not every variable need appear explicitly in the argu- 
ment of fi, so there is an edge from Xj to X; only if 
dfi/dxj is not identically zero. This graph can be rep- 
resented by an adjacency matrix (ay) with ay = 1 if 
there is an edge from i to j and ay = 0 otherwise. The 
degree of a node is the number of edges at the node 
(this can be split into the in-degree (respectively, out- 
degree) if only edges ending (respectively, starting) at 
the node are counted). The proportion of nodes with 
degree k is the degree distribution of the network. Prop- 
erties of the degree distribution are often used to char- 
acterize the network. For example, if the degree dis- 
tribution obeys a power law, the network is said to be 
scale free (the Internet is supposedly of this type; see 
NETWORK ANALYSIS [IV.18]). 

By analyzing subgraphs of biological models it was 
found that some subgraphs appear in examples much 
more often than would be expected on the basis of a 
statistical analysis. This has led to the conjecture that 
these motifs may have associated functional properties. 

In many complex systems the individual components 
of the system behave according to very simple, though 
often nonlinear, rules. For example, a bird in a flock may 
change its direction of flight as a function of the aver- 
age direction of flight of nearby birds. Although this is 
a local rule, the effect across the entire flock of birds is 
to produce coherent movement of the flock as a whole. 
This effect, whereby simple local rules lead to inter- 
esting global results, is called emergent behavior. The 
emergent behavior resulting from given local rules is 
often unclear until the system is simulated numerically. 

In some cases the dimension of the problem can be 
reduced, so fewer variables need to be considered, mak- 


ing the system easier to simulate and more amenable 
to analysis. The methods of dimension reduction often 
rely on singular value decomposition [11.32] tech- 
niques to identify the more dynamically active direc- 
tions in phase space, and then an attempt is made to 
project the system onto these directions and analyze 
the resulting system. 

In some systems the mean-field theory of theoretical 
physics can be used to understand collective behavior. 

Since complexity theory encompasses so many dif- 
ferent models, the range of possible dynamic phenom- 
ena is vast, even before further complications such as 
stochastic effects or network evolution are included. 
Complex systems describing neuron interactions in 
the brain can model pattern recognition and memory 
(see mathematical neuroscience [VII.21]). Numeri- 
cal models of partial differential equations are com- 
plex systems, and the dynamical behavior can include 
synchronization, in which all components lock on to 
a similar pattern of behavior, and pattern forma- 
tion [IV.27]. Different parts of the system may behave 
in dynamically different ways, with regions of frus- 
tration (or fronts) separating them. Interactions may 
have different strengths, leading to different timescales 
in the problem. This is particularly true of many bio- 
logical models and adds to the difficulty of modeling 
phenomena accurately. 

Further Reading 

Ball, R., V. Kolokoltsov, and R. S. MacKay, eds. 2013. Com- 
plexity Science. Cambridge: Cambridge University Press. 
Estrada, E. 2011. The Structure of ComplexNetworks: Theory > 
and Applications. Oxford: Oxford University Press. 

Watts, D. J. 1999. Small Worlds: The Dynamics of Networks 
Between Order and Randomness. Princeton, NJ: Princeton 
University Press. 


II. 5 Conformal Mapping 

Darren Crowdy 


1 What Is a Conformal Mapping? 

Conformal mapping is the name given to the idea of 
interpreting an analytic function of a complex variable 
in a geometric fashion. Let z = x + iy and suppose that 
another complex variable u> is defined by 

w = f (z) = 4>(x,y) + iip(x,y), 

where <p and ip are, respectively, the real and imagi- 
nary parts of some function f(z), an analytic function 
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of z. One can think of this relation as assigning a cor- 
respondence between points in the complex z-plane 
and points in the complex w-plane. Under this func- 
tion a designated region of the z-plane is transplanted, 
or “mapped,” to some region in the lu-plane, as illus- 
trated in figure 1. The shape of the image will depend 
on f. The fact that f is an analytic function implies 
certain special properties of this mapping of regions. If 
the mapping is to be one-to-one, then a necessary, but 
not sufficient, condition is that the derivative /'(z) = 
d//dz does not vanish in the z-region of interest. 

A simple example is the Cayley mapping 


This maps the interior of the unit disk |z| < 1 in the 
z-plane to the right half w -plane Re w >0. The point 
z = 1 maps to w = oo, and z = -1 maps to w = 0. 
The unit circle \z\ = 1 maps to the imaginary tu-axis. 
Conformal mappings clearly preserve neither area nor 
perimeters; their principal geometrical feature is that 
they locally preserve angles. To see this, note that since 
/(z) is analytic at a point zo, it has a local Taylor 
expansion there: 

w = f (z 0 ) + f (zq ) (z - z 0 ) + ■ ■ ■ . 

If <5z = z - zo is an infinitesimal line element through zo 
in the z-plane, its image Sw under the mapping defined 
as Sw = w-wo, where wo = f(z o),is, to leading order, 

Sw « f'(zo)5z. 

But /' (zo) is just a nonzero complex number so, under 
a conformal mapping, all infinitesimal line elements 
through zo are transplanted to line elements through 
wo in the w -plane that are simply rescaled by the mod- 
ulus of /' (z o) and rotated by its argument. In particu- 
lar, the angle between two given line elements through 
zo is preserved by the mapping. 

2 The Riemann Mapping Theorem 

The Riemann mapping theorem is considered by many 
to be the pinnacle of achievement of nineteenth-cen- 
tury mathematics. It is an existence theorem: it states 
that there exists a conformal mapping from the unit z- 
disk to any given simply connected region (no holes) in 
the w -plane, so long as it is not the entire plane. 


3 Conformal Invariance 

One reason why conformal mappings are an impor- 
tant tool in applied mathematics is the property of con- 
formal invariance of certain boundary -value problems 



Figure 1 A conformal mapping from a region in a 
complex z-plane to a region in a complex ic-plane. 


arising in applications. An example is the boundary- 
value problem determining Green’s function G(z;zo) 
for the Laplace equation in a region D in R 2 with 
boundary 3D, which can be written as 

V 2 G = 5 <2) (z - zo) in D with G = 0 on 3D, 

where zo is some point inside D and 5 <2) is the two- 
dimensional dirac delta function [III. 7], The Green 
function for the unit disk |z| < 1 is known to be 

C|z;al ' to ls lp g(si™)]' 

where zo is the complex conjugate of zo. Now if D is any 
other simply connected region of a complex w -plane, 
the corresponding Green function in D is nothing other 
than G(/ _1 ( w ) ; / _1 ( wo ) ) , where f^ 1 (w) is the inverse 
function of the conformal mapping taking the unit z- 
disk to D. Geometrically, / _1 ( w) is just the inverse 
conformal mapping transplanting D to the unit disk 
| z | < 1. The Green function in any simply connected 
region D is therefore known immediately provided the 
conformal mapping between D and the unit disk can 
be found. 

4 Schwarz-Christoffel Mappings 

The Riemann mapping theorem is nonconstructive and, 
while the existence of a conformal mapping between 
given simply connected regions is guaranteed, the prac- 
tical matter of actually constructing it is another story. 
One of the few general constructions often used in 
applications is the Schwarz-Christoffel mapping. This 
is a conformal mapping from a standard region such as 
the unit z-disk |z[ < 1 to the region interior or exterior 
to an JV-sided polygon. At the preimage of any vertex 
of the polygon (a prevertex), the local argument out- 
lined earlier demonstrating the preservation of angles 
between infinitesimal line elements must fail. Indeed, at 
any such prevertex it can be argued that the derivative 
/'(z) of the conformal mapping must have a simple 
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z-plane w-plane 

w 4 w 3 



Figure 2 A Schwarz-Christoffel mapping from the unit 
z-disk to the interior of a square in a m-plane. A function 
of the form (1) with N = 4 identifies a point w with a point 
z. Here, fa = fa = fa = Pi = tt/2. 


zero, a simple pole, or a branch point singularity. The 
general formula for a mapping from |z| < 1 to the 
interior of a bounded polygon in a ii'-plane is 

rZ N . , . (Pfc/7T— 1) 

iv = f(z) = A + Bj n (1- — J dz ’ (!) 

while the formula for a mapping from |z| < 1 to the 
exterior of a bounded polygon in a m-plane, with z = 0 
mapping to w = oo, is 


w = f(z) = A + B 



z ' \ (ftk/TT-l) for 

Zk ) z' 2 ' 


( 2 ) 


The parameters {fa I k = 1,2,..., AT} are the turn- 
ing angles shown in figure 2; the points {zt | k = 
1,2,..., N} are the prevertices. A and B are complex 
constants. These so-called accessory parameters are 
usually computed numerically by fixing geometrical 
features such as ensuring that the sides of the poly- 
gon have the required length. A famous mapping of 
Schwarz-Christoffel type known for its use in aerody- 
namics is the Joukowski mapping , 

iv =f(z) = |(z+ i), 

which maps the unit disk |z| <1 to the infinite region 
exterior to a flat plate, or airfoil, lying on the real w- 
axis between w = -1 and w = 1. It is a simple mat- 
ter to derive it from (2) with the prevertices zi = 1, 
Z2 = -1 and turning angles fa = fa = 2tt. Since it is 
natural, given any two-dimensional shape, to approxi- 
mate it by taking a set of points on the boundary and 
joining them with straight line segments to form a poly- 
gon, the Schwarz-Christoffel formula has found many 
uses in applied mathematics. Versatile numerical soft- 
ware to compute the accessory parameters has also 
been developed. 


Further Reading 

Courant, R. 1950. Dirichlet’s Principle, Conformal Mapping 
and Minimal Surfaces. New York: Interscience. 

Driscoll, T. A., and L. N. Trefethen. 2002. Schwarz-Christof- 
fel Mapping. Cambridge: Cambridge University Press. 
Nehari, Z. 1975. Conformal Mapping. New York: Dover. 


II.6 Conservation Laws 

Barbara Lee Keyfitz 


1 Quasilinear Hyperbolic Partial 
Differential Equations 

A system of first-order partial differential equations 
(PDEs) in the form 
d 

u t + ^ Ai(x, t, u)u Xi + b(x, t, u ) = 0, (1) 

i= 1 

where the At are nx n matrices, and 

Mf = du/dt andu Xi = du/dxt, is said to be quasilinear, 
the system is nonlinear as defined in the article partial 
differential equations [IV.3], but the terms contain- 
ing derivatives of u appear only in hnear combination. 
Identifying t as a time variable and x = (x\ , . . . , x^ ) as a 
space variable, the Cauchy problem asks for a solution 
to (1) for t > 0 with the initial condition 

m(x, 0) = M 0 (x). (2) 

By analogy with the theory of linear PDEs, one expects 
this problem to be well-posed only if the system is 
hyperbolic, which means that all the roots t(5) (known 
as characteristics) of the polynomial equation 

detfr/ + X A i%i) = 0 (3) 

V i=l y 

are real for all § e R d and, as eigenvalues of the matrix 
A^i, each has equal algebraic and geometric 
multiplicities [11.22]. 

In 1974 Fritz John showed that if d = 1, and the sys- 
tem is genuinely nonlinear (meaning that Vjt ; ■ r; =#= 0 
for each root t, of (3) and corresponding eigenvector 
rf), then for smooth Cauchy data at least one com- 
ponent of Vm tends to infinity in finite time, exactly 
as in the burgers equation [III.4] (see also partial 
differential equations [IV.3 §3.6]). 

Characteristics in hyperbolic systems define the 
speed of propagation of signals in specific directions 
(normal to g), so genuine nonlinearity says that this 
speed is a nontrivial function of the state u. This has 
physical significance as a description of the phenomena 
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modeled by conservation laws, and it has mathemati- 
cal implications for the existence of smooth solutions. 
Specifically, the behavior seen in solutions of the Burg- 
ers equation typifies solutions of genuinely nonlinear 
hyperbolic systems. 

Furthermore, despite the fact that distribution 
solutions [IV.3 §5.2] are well defined for linear hyper- 
bolic equations, the concept fails for quasilinear sys- 
tems since, in the first place, A, and A l u Xj are not 
defined if u lacks sufficient smoothness, and, in the sec- 
ond, the standard procedure of creating the weak form 
of an equation (multiply by a smooth test function and 
integrate by parts) does not usually succeed in eliminat- 
ing Vm from the system when A = A(u) depends in a 
nontrivial way on u. The exception is when each A t u x 
is itself a derivative: A,m X| . = d Xi fi(u). This happens 
if each row of each A, is a gradient, and that happens 
only if the requisite mixed partial derivatives are equal. 
In this case, we have a system of balance laws: 
d 

u t + £.“))*( + b(x,t,u) = 0. (4) 

1=1 


In the important case in which b = 0, we have a system 
of consen'ation laws. The weak form of (4) is 


j |" | uqpt + X (/i(*> u))apxi - b(x,t,u)qp dx di = 0. 


(5) 

Since this is the only case in which solutions to (1) can 
be unambiguously defined, the subject of quasilinear 
hyperbolic systems is often referred to as “conserva- 
tion laws.” 

A mathematical challenge in conservation laws is to 
find spaces of functions that are inclusive enough to 
admit weak solutions for general classes of conserva- 
tion laws but regular enough that solutions and their 
approximations can be analyzed. At this time, there is 
a satisfactory well-posedness theory only in a single 
space dimension. 


2 How Conservation Laws Arise 

Problems of importance in physics, engineering, and 
technology lead to systems of conservation laws; a 
sample selection of these problems follows. 

2.1 Compressible Flow 

The basic equations of compressible fluid flow, derived 
from the principles of conservation of mass, momen- 
tum, and energy, along with constitutive equations 


relating thermodynamic quantities, take the form 


Pt + div(pu) = 0, 
(pu) t + div(pw ® u) + Vp = 0, 
( pE)t + di v(puH) = 0, 


( 6 ) 


where p represents density, u velocity, p pressure, E 
energy, and H enthalpy, with 

E = 4- ! m | 2 H — r — , H = yE, 

1 y- i p 

and y a constant that depends on the fluid (y = 1.4 
for air). To obtain the first equation in (6), one notes 
that the total amount of mass in an arbitrary control 
volume D is the integral over D of the density, and this 
changes in time if there is flux through the boundary E 
of D. Furthermore, the flux is precisely the product of 
the density and the velocity normal to that boundary, 
from which we obtain 

ltL PdV = -\r PU ' VdA - (?) 

(The negative sign will remind the reader of the conven- 
tion that v is the outward normal, and flow out of D will 
decrease the mass contained in D.) Interchanging dif- 
ferentiation and integration on the left in (7), along with 
an application of the divergence theorem [1.2 §24] on 
the right, immediately yields 


Jl 


( pt + div(ptt)) dV = 0. 


( 8 ) 


Finally, the observation that D is an arbitrary domain 
in the region allows one to pass to the infinitesimal 
version in (6). The integral version (8) also justifies the 
weak form (5), since if (8) holds on arbitrary domains 
then it is possible to form weighted averages with arbi- 
trary differentiable functions cp and to integrate by 
parts, which produces (5). 

In compressible flow, the speed of sound is finite; in 
(6) it is one of the characteristics. Steady flow at speeds 
that exceed the speed of sound also gives a hyperbolic 
system of conservation laws (6) with the time deriva- 
tives absent. In this case, the hyperbolic direction (the 
time-like variable) is given by the flow direction. 

Conservation principles also lead to equations for 
elasticity [IV.26 §3.3] and magnetohydrodynamics 
[IV.29]. Industrial applications include continuum mod- 
els for multiphase flow (e.g., water mixed with steam 
in nuclear reactor cooling systems, or multicomponent 
flows in oil reservoirs). 

The necessity of solving, or at least approximating, 
conservation laws for many of these applications has 
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resulted in extensive techniques for numerical simu- 
lation of solutions, even when existence of solutions 
remains an open question. 


2.2 Chromatography 


Chromatography is a widely used industrial process for 
separating chemical components of a mixture by differ- 
ential adsorption on a substrate. Modeling a chromato- 
graphic column leads to a system of conservation laws 
in a single space variable that takes the form 

c x + (/(c))t = 0, 


where c = (ci,...,c n ) is a vector of component con- 
centrations and / is the equilibrium column isotherm. 
A common model for / uses the Langmuir isotherm 
and gives, with positive parameters a* measuring the 
relative adsorption rates, 


fi = Ci + 


UiCi 

1 + Z Cj ’ 


1 <i<n. 


2.3 Other Models 


Many other physical phenomena lead naturally to con- 
servation laws. For example, a continuum model for 
vehicular traffic on a one-way road is the scalar equa- 
tion 

ut + q_(u) x = 0 , 

where u represents the linear density of traffic and 
q(u) = uv(u) the flux, where v is velocity. As in (7), 
this equation is a conservation law, the “law of con- 
servation of cars.” This model assumes that the veloc- 
ity at which traffic moves depends only on the traffic 
density. Although this model is too simple to be of 
much practical use, it is appealing as a pedagogical tool. 
Adaptations of it are of interest in current research. 
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II. 7 Control 


A system is a collection of objects that interact and pro- 
duce various outputs in response to different inputs. 
Systems arise in a wide variety of situations and include 
chemical plants, cars, the human body, and a coun- 
try’s economy. Control problems associated with these 
systems include the production of a chemical, control 
of self-driving cars, the regulation of bodily functions 
such as temperature and heartbeat, and the control of 
debt. In each case one wants to have a way of con- 
trolling these processes automatically without direct 
human intervention. 

A general control system is depicted in figure 1. The 
state of the system is described by n state variables x,-, 
and these span the state space. In general, the x, cannot 
be observed or measured directly, but p output vari- 
ables yi , which depend on the x,;, are known. The sys- 
tem is controlled by manipulating m control variables 

Ui. 

The system might be expressed as a system of dif- 
ference equations (discrete time) or differential equa- 
tions (continuous time). In the latter case a linear, 
time-invariant control problem takes the form 

dx , , „ , 

, = Ax(t) + Bu(t), 

at 

y(t) = Cx(t ) + Du(t), 

where A, B, C, and D are n x n, n x m, p x n, and 
p x m matrices, respectively. This is known as a state- 
space system. In some cases an additional nxn matrix 
E, which is usually singular, premultiplies the dx/dt 
term; these so-called descriptor systems or generalized 
state-space systems lead to differential-algebraic 
EQUATIONS [1.2 §12]. 

A natural question is whether, given a starting value 
x(0), the input u can be chosen so that x takes a given 
value at time t. Questions of this form are fundamental 
in classical control theory. 

If feedback occurs from the outputs or state variables 
to the controller, then the system is called a closed-loop 
system. In output feedback, illustrated in figure 1, u 
depends on y, while in state feedback u depends on x. 
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Figure 1 Control system. 


For example, in state feedback we may have u = Fx, for 
some mxn matrix F. Then dx/dt = (A + BF)x, which 
leads to questions about what properties the matrix 
A + BF can be given by suitable choice of F. 

For more details, see control theory [IV.34]. 


II.8 Convexity 

Didier Henrion 


The notion of convexity is central in applied mathemat- 
ics. It is also used in everyday life in connection with 
the curvature properties of a surface. For example, an 
optical lens is said to be convex if it is bulging outward. 

Convexity appears in ancient Greek geometry, e.g., in 
the description of the five regular convex space poly- 
hedra (the platonic solids). Archimedes (ca. 250 b.c.e.) 
seems to have been the first to give a rigorous defini- 
tion of convexity, similar to the geometric definition we 
use today: a set is convex if it contains all line segments 
between each of its points. 

In his study of singularities of real algebraic curves, 
Newton (ca. 1720) introduced a convex polygon in the 
plane built from the exponents of the monomials of 
the polynomial defining the curve; this is known as 
the Newton polygon. Cauchy (ca. 1840) studied convex 
curves and remarked, for example, that, if a closed con- 
vex curve is contained in a circle, then its perimeter 
is smaller than that of the circle. Convex polyhedra 
were studied by Fourier (ca. 1825) in connection with 
the problem of the solvability of linear inequalities. 

A central figure in the modern development of con- 
vexity is Minkowski, who was motivated by problems 
from number theory. In 1891 Minkowski proved that, 
in Euclidean space R" , every compact convex set with 
center at the origin and volume greater than 2” con- 
tains at least one point with integer coordinates dif- 
ferent from the origin. From Minkowski's work follows 
the classical isoperimetric inequality, which states that 
among all convex sets with given volume, the ball is 
the one with minimal surface area. In 1896 Minkowski 
considered systems of the form Ax ^ 0, where A 
is a real mxn matrix and x e R". Together with 




Figure 1 (a) A convex set. (b) A nonconvex set. 

the above-mentioned contribution by Fourier, this laid 
the groundwork for linear programming [IV.ll §3], 
which emerged in the late 1940s, with key contribu- 
tions by Kantorovich (1912-86) and Dantzig (1914- 
2005). In the second half of the twentieth century, 
convexity was developed further by Fenchel (1905-88), 
Moreau (1923-), and Rockafellar (1935-), among many 
others. Convexity is now a key notion in many branches 
of applied mathematics: it is essential in mathemati- 
cal programming (to ensure convergence of optimiza- 
tion algorithms), functional analysis (to ensure exis- 
tence and uniqueness of solutions of problems of cal- 
culus of variations and optimal control), geometry (to 
classify sets and their invariants, or to relate geometri- 
cal quantities), and probability and statistics (to derive 
inequalities). 

Convex objects can be thought of as the opposite, 
geometrically speaking, to fractal objects. Indeed, frac- 
tal objects arise in maximization problems (sponges, 
lungs, batteries) and they have a rough boundary. In 
contrast, convex objects arise in minimization prob- 
lems (isoperimetric problems, smallest energy) and 
they have a smoother boundary. 

Mathematically, a set X is convex if, for all x,y e X 
and for all A 6 [0,1], Ax + (1 - A )y e X (see fig- 
ure 1). Geometrically, this means that the line seg- 
ment between any two points of the set belongs to 
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Figure 2 (a) A convex function, (b) A nonconvex function. 

the set. A real-valued function / : A — R is convex if, 
for all x,y e A and for all A 6 [0,1], it holds that 
/(Ax + (1 - A)y) ^ A f(x) + (1 - A )f(y). Geometri- 
cally, this means that the line segment between any two 
points on the graph of the function lies above the graph 
(see figure 2). This is the same as saying that the epi- 
graph {( x,y): x G A, y > f(x )} is a convex set. If 
a function is twice continuously differentiable, convex- 
ity of the function is equivalent to nonnegativity of the 
quadratic form of the matrix of second-order partial 
derivatives (the Hessian). 

If the function f is convex, then /( AiXi + ■ ■ ■ + 
A m x m ) < Ai/(xi) + ■ ■ ■ + A mf(x m ) for all xi,...,x m 
in A and A in the m - dimensional unit simplex {A e 
: Ai + ■ - - + A m = 1, Ai ^ 0, . . . , A m > 0}. This 
is called Jensen's inequality, and more generally it 
can be expressed as f(jxp( dx)) < J/(x)/t( dx) for 
every probability measure p supported on X, or equiv- 
alently as /(£[x]) < £[/(x)], where E denotes the 
expectation of a random variable. 

A function / is concave whenever the function -f 
is convex. If a function f is both convex and concave, 
it is affine. For this reason, convexity can sometimes 
be interpreted as a one-sided linearity, and in some 
instances (e.g., in problems of calculus of variations 


and partial differential equations), nonlinear convex 
functions behave similarly to Unear functions. 

A set X is a cone if x 6 X implies Ax e X for 
all A ^ 0. A convex cone is therefore a set that is 
closed under addition and under multiplication by pos- 
itive scalars. Convex cones are central in optimization, 
and conic programming is the minimization of a lin- 
ear function over an affine section of a convex cone. 
Important examples of convex cones include the linear 
cone (also called the positive orthant), the quadratic 
cone (also caUed the Lorentz cone), and the semidef- 
inite cone (which is the set of nonnegative quadratic 
forms, or, equivalently, the set of positive-semidefinite 
matrices). 

The convex hull of a set X is the smallest closed con- 
vex set containing X, which is sometimes denoted by 
convA. If X is the union of a finite number of points, 
then convA is the polytope with vertices among these 
points. A theorem by Caratheodory states that given a 
set A c IK.” 1 , every point of convA can be expressed 
as AiXi + ■ ■ ■ + A n x n for some choice of xi, . . . , x n in 
A and A in the n-dimensional unit simplex. 

A theorem of Minkowski (generalized to infinite- 
dimensional spaces in 1940 by Krein and Milman) 
states that every compact convex set is the closure of 
the convex hull of its extreme points (a point x e A is 
extreme if x= (xi +X2)/2 for some xi,X 2 e A implies 
Xi = X 2 ). Finally, we mention the Brunn-Minkowski 
theorem, which relates the volume of the sum of two 
compact convex sets (all points that can be obtained by 
adding a point of the first set to a point of the second 
set) to the respective volumes of the sets. 
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II.9 Dimensional Analysis and Scaling 

Daniela Calvetti and Erkki Somersalo 


Dimensional analysis makes it possible to analyze in 
a systematic way dimensional relationships between 
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physical quantities defining a model. In order to explain 
how this works, we need to introduce some definitions 
and establish the notation that will be used below. 
Consider a system that contains n physical quanti- 
ties, qi, d 2 , ■ ■ ■ , In, that we believe to be relevant for 
describing the system’s behavior, the quantities being 
expressed using r fundamental units, or dimensions, 
denoted by d\,d 2 ,...,d r . The generally accepted SI 
unit system consists of r = 7 basic dimensions and 
numerous derived dimensions. More precisely, d\ is the 
meter (m), c ?2 the second (s), d-$ the kilogram (kg), d± 
the ampere (A), ds the mole (mol), de the kelvin (K), and 
dy the candela (cd). The number r can be smaller, since 
not all units are always needed. The physical dimension 
of a quantity q is denoted by [q]. 

A meaningful mathematical relation between the 
quantities qj should obey the principle of dimensional 
homogeneity, which can be summarized as follows: 
summing up quantities is meaningful only if all the 
terms have the same dimension. Furthermore, any func- 
tional relation of the type 

/(qi,q 2 qn) = o (l) 

should remain valid if expressed in different units. 
In other words, since dimensional scaling must not 
change the equation, it is natural to seek to express 
the relations in terms of dimensionless quantities. It 
is therefore not a surprise that dimensionless quanti- 
ties, known as II -numbers, have a central role in dimen- 
sional analysis. A canonical example of a //-number 
is tt, the invariant ratio of the circumference and the 
diameter of circles of all sizes. 

Given a system described by the physical quanti- 
ties qi, q 2 , . . . , q n , we will define a FI -number, or a 
dimensionless group, to be any combination of those 
quantities of the form 

R = CLi<l2 ■■■dn, ( 2 ) 

where the pj are rational numbers, not all equal to zero, 
and R is dimensionless. If, in such a system, we are able 
to identify k II -numbers, J?i , . . . , i?fc, that characterize 
it, we can describe it with a dimensionless version of 
( 1 ) of the form 

<p(Ri,R2 Rk) = 0 . 

The advantage of the latter formulation is that it auto- 
matically satisfies the dimensional homogeneity; more- 
over, it does not change with any scaling of the model 
that leaves the values of the 17-numbers invariant. 
These points are best clarified by a classical example 
of dimensional analysis. 


Consider steady fluid flow in a pipe of constant diam- 
eter D. The fluid is assumed to be incompressible, hav- 
ing density p and viscosity p. By denoting the pres- 
sure drop across a distance 7 by A p and the (average) 
velocity by v, we may assume that there is an algebraic 
relation between the quantities: 

f(L,D,p,p,v,Ap) = 0. (3) 


In SI units, the dimensions of the variables involved are 


[7] = [D] = m, 


[pi - if. 

m 3 


M = 


kg 


[ v ] 


m 
s ’ 


[A p] = 


kg 


C 2 ‘ 


(4) 


In his classic paper of 1883, Osborne Reynolds sug- 
gested a scaling law of the form 


Ap = py2 i5 F { t ir)' (5) 

where F is some function; Reynolds himself considered 
the power law F(R) = cR~ n with different values of 
n and experimentally validated it. Equation (5) can be 
seen as a dimensionless version of (3), 


<P(Ry,R2,R3) = 0, 


where 

Ri = P —, 




The quantities R\ and R 2 are known as the Reynolds 
number and the Euler number, respectively, and it is 
a straightforward matter to check that Ri, R 2 , and R 3 
are dimensionless. The scaling law (5) has been experi- 
mentally validated in a range of geometric settings. An 
example of its use is the design of miniature models. 
If the dimensions are scaled by a factor a, 7 — ■ «7, 
D — otD, we may assume that the flow in the minia- 
ture model gives a good prediction for the actual sys- 
tem if we scale the velocity and pressure as v — ■ v / a 
and A p — ■ A p/ a 2 , leaving the dimensionless quantities 
intact. 

In view of the above example it is natural to ask how 
many 7/ -numbers characterize a given system and if 
there is a systematic way of finding them. To address 
these questions it is important to identify possible 
redundancy among the physical quantities, on the one 
hand, and the dimensions, on the other. With this in 
mind we introduce the concepts of independency and 
relevance of the dimensions. 

The dimensions di,...,d r are independent if none 
can be expressed as a rational product of the others, 
that is, 


d^'d* 2 ' ' ‘ d? r = 1 


( 6 ) 
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if and only if oq = c *2 = ■ ■ ■ = ct r = 0. The dimensions 
dj may be the fundamental dimensions of the SI unit 
system or derived dimensions, such as the newton (N = 
kg/m ■ s 2 ). 

It is not a coincidence that this definition strongly 
resembles that of linear independency in linear algebra, 
as will become evident later. 

Let a system be described by n quantities, qi,...,q n , 
and r dimensions, di,...,d r , with the dimensional 
dependency 

[qj] = df 1 d^ 2 ■ ■ ■ dr ir , l^j^n. (7) 

We say that the dimensions dk, 1 ^ k ^ r, are relevant 
if for each dk there are rational coefficients ay such 
that 

d k = [qi] akl [q 2 ] ak2 ■ ■ - [qn]^- (8) 

In other words, the dimensions dk are relevant if they 
can be expressed in terms of the dimensions of the vari- 
ables qk ■ It follows immediately that, if the quantities qj 
can be measured, then there must exist an operational 
description of all units in terms of the measurements. 
Identifying relevant quantities maybe more subtle than 
it seems. 

For the sake of definiteness, assume that we adhere 
to the SI system, and denote the seven basic SI units by 
ei, e 2 , . . . , ey, the ordering being unimportant. We now 
proceed to define an associated dimension space: to 
each ei we associate a vector e* e R 7 , where e, is the 
ith unit coordinate vector. Further, we define a group 
homomorphism between the Q-moduli of dimensions 
and vectors; since any dimension d can be represented 
in the SI system in terms of the seven basic units e, as 

d = e l ■ ■ ■ e 7 , 

we associate d with a vector d, where 
d = viei + ■ ■ ■ + vye 7 . 

Along these lines, we associate with a quantity q with 
dimensions 

[q] = df ■ ■ ■ d? r 

the vector 

q = Pld\ + - - ■ + Pydy. 

It is straightforward to verify that the representation of 
q in terms of the basis vectors ej is unambiguous. 

We are now ready to revisit independency of units in 
the light of the associated vectors. In linear algebraic 
terms, condition (6) is equivalent to saying that 

aidi + ■ ■ ■ + Ciydy = 0, 


and therefore the independency of dimensions is equiv- 
alent to the linear independency of the corresponding 
dimension vectors. 

Next we look for a connection with linear algebra 
to help us reinterpret the concept of relevance. In the 
dimension space, condition (7) can be expressed as 

r 

qj = hi 1 d 1 + ■ ■ ■ + Pjydy = X Pjkdk, 

k= 1 

which implies that every qj is in the subspace spanned 
by the vectors dk, while the linear algebraic formulation 
of condition (8), 

n 

dk = CXkiqi + ' ' ' + Ctknqn = X a ktqt, 

7=1 

states that the vectors dk are in the subspace spanned 
by the vectors qg. We therefore conclude that the rel- 
evance of dimensions is equivalent to the condition 
that 

spanfqi q n } = spanfdi, . . . , d r }. 

It is obvious that when n > r, there must be redun- 
dancy among the quantities because the subspace can 
be spanned by fewer than n vectors. This redundancy 
is indeed the key to the theory of 17 -numbers. 

Let us take a second look at the definition of 17 -num- 
ber, (2). In order for a quantity to be dimensionless, the 
coefficients of the dimension vectors must all vanish, 
which, in the new formalism, is equivalent to the cor- 
responding dimension vector being the zero vector. In 
other words, equation (2) is equivalent to 

Piqi + h2<l2 + ■ ■ ■ + p n q n = R = 0. 

If we now define the dimension matrix of the quantities 
qi , . . . , q n to be 

Q = (q i tl2 ■ ■ ■ qn) e R rxn , 
we can immediately verify that the vector p 6 M' 1 with 
entries Pj must satisfy Qp = 0, so p must be in the null 
space of Q, 7V(Q). 

We can now restate the definition of 17 -number in 
the language of linear algebra: R = ■ ■ ■ qn" is a 17- 

number if and only if p e 7V(Q). 

It is a central question in dimensional analysis how 
many essentially different 17-numbers can be found 
that correspond to a given system. If R\ and R 2 are 
two 17-numbers, their product and ratio are also 17- 
numbers, yet they are not independent. To find out 
how to determine which 17-numbers are independent, 
assume that Hi and R 2 correspond to vectors p and v 
in the null space of Q. From the observation that 

D w n _ _(*1 ^ Vi V„ _ Ul+Vl U„ + V„ 

R 1 x R 2 — q ] ■ ■ ■ qn X tj | ■ ■ ■ q n — q\ ■ ■ ■ qn , 
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it follows that multiplication of two 17-numbers cor- 
responds to addition of the corresponding vectors in 
the null space of the dimension vector. This naturally 
leads to the definition that the 17-numbers {Ri,...,Rk} 
are essentially different if the corresponding coefficient 
vectors in lAf(Q) are linearly independent. In particu- 
lar, the number of essentially different 17 -numbers is 
equal to the dimension of Ihf(Q), and a maximal set of 
essentially different 17-numbers corresponds to a basis 
for IW(Q). 

It is now easy to state the following central theorem 
of dimensional analysis, which is a corollary of the the- 
orem about the dimensions of the four fundamental 
subspaces [1.2 §21] of a dimension matrix. 

Buckingham’s 17 theorem. If a physical problem is 
described byn variables, with every variable expressed 
in terms of r independent and relevant dimensions, 
the number of essentially different fl -numbers (dimen- 
sionless groups whose numerical values depend on the 
properties of the system) is at most n - r. 

It is important to stress that the number of essen- 
tially different 77 -numbers is “at most” n - r because 
the system may actually admit fewer. It is a nice corol- 
lary that the 17 -numbers of a system can be found by 
computing a basis for the null space of the dimen- 
sion matrix by Gaussian elimination, which results in 
rational coefficients. 

Returning to our example from fluid dynamics, let a 
system be described by the five quantities length (7), 
a characteristic scalar velocity (no), density (p), viscos- 
ity (p), and pressure (p), the dimensions of which were 
given in (4). We characterize the system with three SI 
units, m, s, and kg. The dimension matrix in this case 
is 



To find a basis of the null space we reduce the matrix 
to its row echelon form by Gauss-Jordan elimination, 
which shows that its rank is three. This implies that the 
null space is two dimensional, with a basis consisting 
of the two vectors 



corresponding to the Reynolds number and the Euler 
number, respectively: 

Ri = L 1 VqP 1 /j- 1 p 0 , Rz = L°VqP^ 1 p°p 1 . 

To appreciate the usefulness of finding these 17- 
numbers, consider the nondimensionalization of the 
NAVIER-STOKES EQUATION [III. 23 ], 

/ dl> \ 

pi — + v ■ Vv I = -Vp + pAv, 

where A = V ■ V. Assuming that a characteristic 
speed vo (e.g., an asymptotic value) and a characteris- 
tic length scale 7 are given, first we nondimensionalize 
the velocity and the spatial variable, writing 

v = vq 9, x = Tg, 


and then we define a dimensionless pressure field 
based on the nondimensionality of the Euler number 

R 2 , 

7T(§) = -^p(7g), 

py o 

arriving at the scaled version of the equation: 


pvl t L d9 
L \vq dt 


+ 9 ■ V'll) 



pvo 

L 2 


A'9, 


where V' = Vg and A' = V' ■ V'. By going further 
and defining the time in terms of the characteristic 
timescale L/vq, 



the nondimensional version of the Navier-Stokes equa- 
tion ensues: 


^ + 9 ■ S7'9 = -V'tt + -^A'S. 
St R i 


This form provides a natural justification for the dif- 
ferent approximations corresponding to, for example, 
nonviscous fluid flow (Ri large) or nonturbulent flow 
(Ri small). 
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Barenblatt, G. I. 1987. Dimensional Analysis. New York: 
Gordon & Breach. 

. 2003. Scaling. Cambridge: Cambridge University 

Press. 

Calvetti, D., and E. Somersalo. 2012. Computational Mathe- 
matical Modeling: An Integrated Approach across Scales. 
Philadelphia, PA: SIAM. 

Mattheij, R. M. M., S. W. Rienstra, and J. H. M. ten Thije 
Boonkkamp. 2005. Partial Differential Equations: Model- 
ing, Analysis and Computation. Philadelphia, PA: SIAM. 
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II. 10 The Fast Fourier Transform 

Daniel N. Rockmore 


In 1965 James Cooley and John Tukey wrote a brief 
article (a note, really) that laid out an efficient method 
for computing the various trigonometric sums neces- 
sary for computing or approximating the Fourier trans- 
form of a function on the real line. While theirs was 
not the first such article (it was later discovered that 
the algorithm’s fundamental step was first sketched in 
papers of Gauss), what was very different was the con- 
text. Newly invented analog-to-digital converters had 
now enabled the accumulation of (for the time) extraor- 
dinarily large data sets of sampled time series, whose 
analysis required the computation of the underlying 
signal’s Fourier transform. In this new world of 1960s 
“big data,” a clever reduction in computational com- 
plexity (a term not yet widely in use) could make a 
tremendous difference. 1 

While the Cooley-Tukey approach is what is usually 
associated with the phrase “fast Fourier transform” (or 
“FFT”), this term more correctly refers to a family of 
algorithms designed to accomplish the efficient calcu- 
lation of the Fourier transform [11.19] (or an approx- 
imation thereof) of a real-valued function f sampled at 
points Xj (on either the real line, the unit interval, or 
the unit circle): samples go in and Fourier coefficients 
are returned. The discrete sums of interest 
n - 1 

f(k) = X fU)w J n (1) 

3 = 0 

computed for each k = 0, . . . , n — 1, where to n = e 2m/ " 
is a primitive nth root of unity and f(j) = f(Xj), 
make up what is usually called the “discrete Fourier 
transform” (DFT). This can be written succinctly as the 
outcome of the matrix-vector multiplication 

f = Of, ( 2 ) 

where the (j, k) element of Q is a>i, k . 

1 The Cooley-Tukey FFT 

If computed directly, the DFT requires n 2 multiplica- 
tions and n(n-l) additions, or 2 n 2 - n arithmetic oper- 
ations (assuming the /( j) values and the powers of the 


1. Many years later Cooley told me that he believed that the fast 
Fourier transform could be thought of as one of the inspirations for 

asymptotic algorithmic analysis and the study of computational com- 
plexity, as previous to the publication of his paper with Tukey very few 
people had considered data sets large enough to suggest the utility of 
an asymptotic analysis. 


root of unity have been precomputed and stored). Note 
that this is approximately 2n 2 (and, asymptotically, 
0(n 2 )) operations. The “classical” FFT (i.e., the Cooley- 
Tukey FFT) can be employed in the case in which n can 
be factored, n = pq, whereupon we can take advan- 
tage of a concomitant factorization of the calculation 
(which, in turn, is a factorization of the matrix 12) that 
can be cast as a divide and conquer algorithm 
[ 1.4 §3], writing the DFT of order n as p DFTs of order 
q (or q DFTs of order p). More explicitly, in this case we 
can write 

j = j(a, b) = aq + b, 0 ^ a < p, 0 ^ b < q, 
k = k(c,d ) = cp + d, 0 ^ c < q, 0 ^ d < p, 
so that (1) can be rewritten as 

f{c,d) = Y con icp+d) Y f(cL,b)w“ d (3) 

b= 0 a= 0 

using the fact that a = Wp d . 

Computation of f is now performed in two steps. 

First, compute for each b the inner sums (for all d) 

p - 1 

f(b,d)=Yf( a ’ h l w p d , (4) 

a= 0 

which have the form of DFTs of length p equispaced 
among multiples of q. In engineering language, (4) 
would be called “a subsampled DFT of length p." 

Direct calculation of all the f(b,d) requires pq[p + 
(p - 1)] arithmetic operations. Step two is to then 
compute an additional pq transforms of length q, 

f(c,d) = Y Wn cp+d) f(b,d), 
b = 0 

requiring at most an additional pq[q+ (q - 1)] opera- 
tions to complete the calculation. Thus, instead of the 
approximately 2 n 2 = 2 (pq) 2 operations required by 
direct computation, the above algorithm uses approx- 
imately 2 (pq)(p + q) operations. If n can be factored 
further, this approach works even better. When n is a 
power of two, the successive splittings of the calcula- 
tion give the well-known O ( n log 2 n) complexity result 
(in comparison to 0(n 2 )). 

Since Q*Q = nl, from (2) we have / = n~ l Q* f, 
so the discretized function / = (/(0),...,/(n - 1)) 
(sample values) can be recovered from its Fourier coef- 
ficients via 

f(m) = - Yf(k)io n mk , 

n k 
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a so-called inverse transform. The inverse transform 
expresses / as a superposition of (sampled) exponen- 
tials or, equivalently, sines and cosines of frequencies 
that are multiples of 2 tt/ n, so that if we think of / as a 
function of time, the DFT is a change of basis from the 
“time domain” to the “frequency domain.” 

In the case in which n = 2 N - 1 and the fixj) rep- 
resent equispaced samples of a bandlimited function 
on the circle (or, equivalently, on the unit interval), 
so that Xj = j In, and of bandlimit N (i.e., fik) = 0 
for all k > N), then (up to a normalization) the sums 
exactly compute the Fourier coefficients of the func- 
tion f (suitably indexed). The form of the inverse trans- 
form can itself be restated as a DFT, so that an FFT 
enables the efficient change of basis between the time 
and frequency domains. 

The utility of an efficient algorithm for computing 
these sums cannot be overstated— occupying as it does 
a central position in the world of signal processing 
[IV.35], IMAGE PROCESSING [VII.8], and INFORMATION 
processing [IV.36]— not only for the intrinsic inter- 
est in the Fourier coefficients (say, in various forms of 
spectral analysis, especially for time series) but also 
for their use in effecting an efficient convolution of 
data sequences via the relation (for two functions on 
n points) 

if * a)(k) = f(k)g(k), 

where 

n - 1 

if * g)(k) = X “ m)g(m). (5) 

m = 0 

If computed directly for all k, (5) requires n[n + in - 
1)] = 0(n 2 ) operations. An efficient FFT-based convo- 
lution is effected by first computing / and g, then using 
n operations for pointwise multiplication of the trans- 
formed sequences, and then using another FFT for the 
efficient inverse transform back to the time domain. 

This relationship is the key to FFTs that work for 
data streams of prime length p. The best-known ideas 
make use of rewriting the DFT at nonzero frequencies 
in terms of a convolution of length p - 1 and then com- 
puting the DFT at the zero frequency directly. One well- 
known example is Rader’s prime FFT, which uses the 
fact that we can find a generator g of Z/pZ x , a cyclic 
group (under multiplication) of order p — 1, to write 
hg~ b ) as 

P-2 

hg- b ) = /( 0) + X / (g a )e 2TTla “ b lp . (6) 

a = 0 


The summation in (6) has the form of a convolution of 
length p - 1 of the sequence f'ia) = f(g a ) with the 
function z(a) = e 2m 3 a lp_ 

Through the use of these kinds of reductions— con- 
tributions by various members of the “FFT family” — we 
achieve a general 0(n log 2 n) algorithm. 

Further Reading 

Brigham, E. 0. 1988. The Fast Fourier Transform and Its 
Applications. Englewood Cliffs, NJ: Prentice-Hall. 

Cooley, J. W. 1987. The re-discovery of the fast Fourier 
transform algorithm. Mikrochimica Acta 3:33-45. 
Heideman, M. T., D. H. Johnson, and C. S. Burrus. 1985. 
Gauss and the history of the fast Fourier transform. 
Archive for History' of Exact Sciences 34(3):265-77. 
Maslen, D. K., and D. N. Rockmore. 2001. The Cooley- 
Tukey FFT and group theory. Notices of the American 
Mathematical Society 48(10):1151-61. 
van Loan, C. F. 1992. Computational Frameworks for the Fast 
Fourier Transform. Philadelphia, PA: SIAM. 


II. 1 1 Finite Differences 


In the definition of the derivative of a real function / 
of a real variable, fix) = lim £ „o(/(* + £ ) - fix))/E, 
we can take a small positive £ = h > 0 and form the 
approximation 

fix + h) - fix) 

1 (X > ~ h 


This process is called discretization and the approxima- 
tion is called a forward difference because we evaluate 
/ at a point to the right of x. We could instead take a 
small negative s, so that with h = —e we have 
r,^ x) = f(x-h) -fix) = fix) -f(x - h) 

-h h 

The latter approximation is a backward difference. 
Higher derivatives can be approximated in a similar 
fashion. An example is the centered second difference 
approximation 


fix+h)-2fix) + f(x-h) 

J ix) ~ h2 

The term finite differences is used to describe such 
approximations to derivatives by linear combinations 
of function values. One way to derive finite-difference 
approximations, and also to analyze their accuracy, is 
by manipulating taylor series [1.2 §9] expansions. A 
more systematic approach is through the calculus of 
finite differences, which is based on operators such as 
the forward difference operator A fix) = fix + h) - 
fix) and its powers: A 2 fix) = A(A fix)) = fix + 
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2 h) - 2 fix + h) + fix) and so on. The term “calcu- 
lus” is used because there are many analogies between 
these operators and the differentiation operator. Finite- 
difference calculus is thoroughly developed in classical 
numerical analysis texts of the last century, but it is 
less commonly encountered nowadays. 

Finite differences can be used to approximate partial 
derivatives in an analogous way. For example, if fix, y) 
is a function of two variables, then 

U ix s fix + E,y) - fix, y) 
dx y E 

Finite differences are widely used in numerical meth- 
ods for solving ordinary differential equations 
[IV. 12] and partial differential equations [IV.13]. 


11.12 The Finite-Element Method 


The finite-element method is a method for approximat- 
ing the solution of a partial differential equation (PDE) 
with boundary conditions over a given domain using 
piecewise polynomial approximations to the unknown 
function. The domain is partitioned into elements, typ- 
ically triangles for a two-dimensional region or tetra- 
hedrons in three dimensions, and on each element 
the solution is approximated by a low-degree polyno- 
mial. The approximations are obtained by solving a 
variational form of the PDE within the corresponding 
finite-dimensional subspace, which reduces to solving 
a sparse linear system of equations for the coefficients. 

For more on the finite-element method, see numer- 
ical SOLUTION OF PARTIAL DIFFERENTIAL EQUATIONS 
[IV.13 §4] and mechanics of solids [IV.32 §4.2]. 


11.13 Floating-Point Arithmetic 

Nicholas f. Higham 


The real line, R, contains infinitely many numbers, but 
in many practical situations we must work with a finite 
subset of the real line. For example, computers have 
a finite number of storage locations in their random 
access memory, so they can represent only a finite set of 
numbers, whereas in a bank savings account amounts 
of money are recorded to two decimal places (dollars 
and cents, for example) and may be limited to some 
maximum value. In the latter situation, assuming a rep- 
resentation with n base- 10 digits, all possible num- 
bers can be expressed as ±d\d 2 . . . d n - 2 -dn-id n , where 
the di are integers between 0 and 9. This is called a 
fixed-point number system because the decimal point 















0 0.5 1.0 2.0 3.0 4.0 5.0 6.0 7.0 

Figure 1 The ticks denote the nonnegative numbers 
in the simple floating-point number system (1). 


is in a fixed position, here just before the in - l)st 
digit. A floating-point number system differs in that an 
extra multiplicative term that is a variable power of the 
base allows the decimal point (or its analogue for other 
bases) to move around. 

A simple example of a floating-point number system 
is the set of numbers, with base 2, 

x - i2 ,(| + 4 l + |), i2</i (1) 

where the exponent ce {-1,0, 1,2, 3], and each binary 
digit di is either 0 or 1. The number/ = i0.did2d-j)2 is 
called the significand (or mantissa). If we assume that 
d\ f 0 (hence d i = 1) then each x in the system has 
a unique representation (1) and the system is called 
normalized. The nonnegative numbers in this system 
are 

0, 0.25, 0.3125, 0.3750, 0.4375, 0.5, 0.625, 0.750, 
0.875, 1.0, 1.25, 1.50, 1.75, 2.0, 2.5, 3.0, 3.5, 4.0, 
5.0, 6.0, 7.0, 

and they are represented pictorially in figure 1. 

Notice that the spacing of the numbers in this exam- 
ple increases by a factor 2 at every power of 2. It is 
a very important feature of all floating-point number 
systems that the spacing of the numbers is not con- 
stant (whereas for fixed-point systems the spacing is 
constant). 

Other floating-point number systems are obtained by 
varying the range of values that the exponent e can 
take, by varying the number of digits di in the signifi- 
cand, and by varying the base. Historically, computers 
have mainly used base 16 or base 2. Pocket calculators 
instead use base 10, in order to avoid users being con- 
fused by the effects of the errors in converting from 
one base to another. 

To use a floating-point number system, F, for practi- 
cal computations, we need to put our data into it. Some 
real numbers can be exactly represented in F, while oth- 
ers can only be approximated. How should the conver- 
sion from R to F be done, and how large an error is 
committed? The mapping from R to F is called round- 
ing and is denoted by “fl.” The usual definition of fl(x), 
for x e R, is that it is the nearest number in F to x. A 
rule is needed to break ties when x lies midway between 
two members of F; the most common rule is to take 
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whichever number has an even last digit in the signifi- 
cand. Following this rule in our toy system (1), we have 
fl(l.l) = 1.0 and f 1( 5. 5 ) = 6.0 = 2 3 (0.110) 2 - If x has 
magnitude greater than the magnitude of every number 
in F then we say fl(x) overflows, while if x 0 rounds 
to zero we say that fl(x) underflows. 

Let F be a floating-point system with base f and t 
digits di. It is possible to prove that, if |x| lies between 
the smallest nonzero number in F and the largest num- 
ber in F, then fl(x) = x(l + 8) with |5| ^ n, where 
u = jP 11 is called the unit roundoff. This means that 
the relative error in representing x in F is at most u. 
For example, in our toy system we can represent rr with 
relative error at most j2 1-3 = 

When we multiply two t-digit numbers xi,X 2 £ F 
the product has 2t - 1 or 2t significant digits, and thus 
in general is not itself in F. The best we can do is to 
round the result and take that as our approximation to 
the product: y = fl(xiX 2 ). Similarly, addition, subtrac- 
tion, and division of numbers in F also generally incur 
errors. If again we take the rounded version of the exact 
result then we can say that 

fl(x op y) = (x op jU(l + 8), |5|<u, (2) 

for op = +, — , *, /. 

Virtually all today’s computers implement floating- 
point arithmetic in conformance with a 1985 IEEE 
standard. This standard defines two forms of base-2 
floating-point arithmetic: one called single precision, 
with t = 24, and one called double precision, with 
1 = 53. In IEEE arithmetic the elementary operations +, 
-, *, and / are defined to be the rounded exact opera- 
tions, and so (2) is satisfied. This makes (2) a very use- 
ful tool for analyzing the effects of rounding errors on 
numerical algorithms. 

Historically, (2) has been just a model for floating- 
point arithmetic— an assumption on which analysis 
was based. Prior to the widespread adoption of the 
IEEE standard, different computer manufacturers used 
different forms of arithmetic, some of which did not 
satisfy (2) or lacked certain other desirable properties. 
Thi s led to bizarre situations such as the expression 
x/Jx 2 + y 2 occasionally evaluating to a floating-point 
number greater than 1. Fortunately, the IEEE standard 
ensures that elementary floating-point computations 
retain many of the properties of arithmetic. However, 
special-purpose processors, such as graphics process- 
ing units (GPUs), do not necessarily (fully) comply with 
the IEEE standard. 


It is often thought that subtraction of nearly equal 
floating-point numbers is dangerous because of cancel- 
lation. In fact, the subtraction is done exactly. Indeed, if 
x and y are floating-point numbers with y/2 ^ x ^ 2 y 
then fl(x - y) = x - y (as long as x - y does not 
underflow). The danger of cancellation is that it causes 
a loss of significant digits and can thereby bring into 
prominence errors committed in earlier computations. 
A classic example is the usual formula for solving a 
quadratic equation such as x 2 -56x+1 = 0, which gives 
Xi = 28 + V783 and X 2 = 28 - V783. Working to four 
significant decimal digits, V783 = 27.98, and therefore 
the computed results are X\ = 55.98 = 5.598 x 10 1 
and X 2 = 0.02 = 2.000 x 10~ 2 . However, the correct 
results to four significant digits are Xi = 5.598 x 10 1 
and X 2 = 1.786 x 10~ 2 . So while the larger computed 
root has all four digits correct, the smaller one is 
very inaccurate due to cancellation. Fortunately, since 
x 2 - 56x + 1 = (x - xi)(x - X 2 ), the product of the 
roots is 1, and so we can obtain an accurate value for 
the smaller root X 2 from X 2 = 1 / xi . 

Further Reading 

Higham, N. J. 2002. Accuracy and Stability of Numerical 
Algorithms, 2nd edn. Philadelphia, PA: SIAM. 

Muller, J.-M., N. Brisebarre, F. de Dinechin, C.-P. Jeannerod, 
V. Lefevre, G. Melquiond, N. Revol, D. Stehle, and S. Torres. 
2010. Handbook of Floating-Point Arithmetic. Boston, MA: 
Birkhauser. 


11.14 Functions of Matrices 

Nicholas J. Higham 


Given the scalar function/(x) = (x-l)/(l+x 2 ) we can 
define f(A) for an n xn matrix A by f(A) = (A-I)(I + 
A 2 )- 1 , as long as A does not have ±i as an eigenvalue, 
so that I + A 2 is nonsingular. This notion of “replacing 
x by A” is very natural and produces useful results. For 
example, we can define the exponential of A by 


The resulting function satisfies analogues of the prop- 
erties of the scalar exponential, such as e A e~ A = I, 
e A = lints -.«>(/ + A/s) s , and (d/dt)e At = Ae At = e At A. 
Thanks to the latter relation, the matrix exponential 
plays a fundamental role in linear differential equa- 
tions. In particular, the general solution to dy /dt = Ay 
is y(t) = e At c, for a constant vector c. However, not 
every scalar relation generalizes. In particular, e A+B = 
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e A e B holds if A and B commute, but it does not hold in 
general. 

More generally, for any function with a Taylor series 
expansion the scalar argument can be replaced by a 
square matrix as long as the eigenvalues of the matrix 
are within the radius of convergence of the Taylor 
series. Thus we have 


, . . . A 2 A 4 

COS(A) 

.... . A 3 A 5 

sm(A)=A- — + — - 

A 2 A 3 

log (I + A)=A- — + — - 


A 6 

6T 

A 7 



P(A) < 1 , 


where p denotes the spectral radius [1.2 §20]. The 
series for log raises two questions: does X = log (I + A) 
satisfy e A = I + A and, if so, which of the many matrix 
logarithms is produced (note that if e x = I + A then 
e x+ 2 kmi _ e ,Y e 2 kmi _ / + a for any integer k)? The 
answer to the first question is yes. The answer to the 
second question is that the logarithm produced is the 
principal logarithm, which for a matrix with no eigen- 
values lying on the nonpositive real axis is the unique 
logarithm all of whose eigenvalues have imaginary 
parts lying in the interval ( -tt, tt). 

Defining /(A) via a power series may specify the 
function only for a certain range of A, as for the log- 
arithm, and moreover, some functions do not have a 
(convenient) power series. For more general functions 
a different approach is needed. 

If / is analytic on and inside a closed contour f that 
encloses the spectrum of A, then we can define 


/ (A) := ' . f /(zHzI-Ar'dz, 

2tti Jr 

which is a generalization to matrices of the cauchy 
integral formula [IV.l §7]. Another definition can be 
given in terms of the Jordan canonical form [11.22] 

Z -1 AZ = J = diagC/i, J 2 J p ), where J k is an m k x 

m k Jordan block with eigenvalue A*;. The definition is 

f(A) := Zf(J)Z- 1 = Zdiag(/(A))Z-\ 


where 

[/(A*) f'i Ajt) 


fUk) ■= 


/(Ak) 


fim.k-1) (Afc) ~ 
(m k - 1)! 


/'(Ak) 
/(Ak) . 


This definition does not require / to be analytic 
but merely requires the existence of the derivatives 
fW (Afc) for j up to one less than the size of the largest 


block in which Afc appears. Note that when A is diag- 
onalizable, that is, A = ZDZ~ l for D = diag(At), the 
definition is simply /(A) = Z/(D)Z _1 , where f(D) = 
diag(/(A,)). 

The Cauchy integral and Jordan canonical form defi- 
nitions are equivalent when / is analytic. 

Some key properties that follow from the defini- 
tions are that /(A) commutes with A, f(X^ 3 AX) = 
X~ l f(A)X for any nonsingular X, and /(A) is upper 
(lower) triangular if A is. It can also be shown that 
certain forms of identity carry over from the scalar 
case to the matrix case, under assumptions that ensure 
that all the relevant matrices are defined. Examples are 
exp(iA) = cos(A) + isin(A) and cos 2 (A) + sin 2 (A) = 
I. However, care is needed when dealing with multi- 
valued functions; for example, for the principal loga- 
rithm, log(e^) cannot be guaranteed to equal A without 
restrictions on the spectrum of A. 

Another important class of functions is the pth roots: 
the solutions of X p = A, where p is a positive integer. 
For nonsingular A there are many p I h roots. The one 
usually required in practice is the principal pth root, 
defined for A with no eigenvalues lying on the nonpos- 
itive real axis as the unique pth root whose eigenvalues 
lie strictly within the wedge making an angle rr/p with 
the positive real axis, and denoted by A 1/p . Thus A 1/2 
is the square root whose eigenvalues all lie in the open 
right half-plane. 

The function sign(A) = A(A 2 )~ 1/2 , defined for any 
A having no pure imaginary eigenvalues, is the matrix 
sign function. It has applications in control theory, in 
particular for solving algebraic riccati equations 
[HI.25], and corresponds to the scalar function mapping 
complex numbers in the open left and right half -planes 
to -1 and 1, respectively. 

Matrix functions provide one-line solutions to many 
problems. For example, the second-order ordinary dif- 
ferential equation initial-value problem 

d 2 v 

df 2 +Ay = 0, y(0) = y 0 , y'(0) = y' Q , 

with y an n-vector and Aannxti matrix, has solution 

y(t) = cos (sfAt)yo + (VA )" 1 sin(VAt)yo- 

where ~J~A denotes any square root of A. Alternatively, 
by writing z = [ y y ] we can convert the problem into 
two first-order differential equations: 




"0 -A 

V" 


~0 -A 

y. 


I 0 



J 0 
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from which follows the formula 


y'(t) 

_y(t) 


/To -tA 

Ur 0 


>0 

yo 


There is an explicit formula for a function of a 2 x 2 
triangular matrix: 


Ai a 


7(Ai) 

«/[Ai,A 2 r 

_ 0 a A 


0 

/( A 2 ) 


where 


/[Ai,A 2 ] 


f /(A 2 )-/(Ai) 

J A 2 - Ai 

[/'(A 2 ), 


Ai / A 2 , 


Ai = A 2 , 


is a first-order divided difference. This formula extends 
to n x n triangular matrices T, although the formula 
for the (i,j) element contains up to 2” terms and so is 
not computationally useful unless n is very small. It is 
nevertheless possible to compute F = f(T) for an nxn 
triangular matrix in n 3 / 3 operations using the Parlett 
recurrence, which is obtained by equating elements in 
the equation TF = FT. 


Further Reading 

Higham, N. J. 2008. Functions of Matrices: Theory and 
Computation. Philadelphia, PA: SIAM. 

Higham, N. J., and A. H. Al-Mohy. 2010. Computing matrix 
functions. Acta Numerica 19:159-208. 


11.15 Function Spaces 

Hans G. Feichtinger 


While in the early days of mathematics each function 
was treated individually, it became appreciated that it 
was more appropriate to make collective statements 
for all continuous functions, all integrable functions, or 
all continuously differentiable ones. Fortunately, most 
of these collections of functions (fk) are closed under 
addition and allow the formation of linear combina- 
tions Xf=t Ckfk for real or complex coefficients Ck, 
1 ^ k ^ A'. They are, therefore, vector spaces. In addi- 
tion, most of these spaces are endowed with a suit- 
able norm [1.2 §19.3] / •- ||/||, allowing one to mea- 
sure the size of their members and hence to intro- 
duce concepts of closeness by looking at the distance 
d(j\ , f->) '■= Wfi-fiW- One can therefore say that a func- 
tion space is a normed space consisting of (generalized) 
functions on some domain. 

For the vector space Cb ( D ) of bounded and continu- 
ous functions on some domain Del 1 *, the sup-norm 


il/llco := sup ZGl , |/(z)| is the appropriate norm. With 
this norm, Cb(-D) is a Banach space (that is, a complete 
normed space), i.e., every cauchy sequence [1.2 §19.4] 
with respect to this norm is convergent to a unique 
limit element in the space. Hence, such Banach spaces 
of functions share many properties with the Euclidean 
spaces of vectors in R d , with the important distinction 
that they are not finite dimensional. 

1 Lebesgue Spaces L p (R d ) 

The completeness of the Lebesgue space I 1 (M d ), con- 
sisting of all (measurable) functions with ||/||i := 
J K d |/(z) | dz < oo, is the reason why the Lebesgue inte- 
gral is preferred over the Riemann integral. Note that 
in order to ensure the property that ||/|!i = 0 implies 
/ = 0 (the null function), one has to regard two func- 
tions fi and / 2 as equal if they are equal up to a set 
of measure zero, i.e., if the set {z | /i(z) * / 2 (z)} has 
Lebesgue measure zero. 

Another norm that is important for many appli- 
cations is the L 2 -norm, ||/|| 2 := (J Rd |/(z)| 2 ) 1/2 . It 
is related to an inner product defined as (f,g) := 
f(z)g(z) dz via the formula ||/[| 2 := (/,/) 1/2 . 
(L 2 (R d ),IHI 2 ) is a Hilbert space, and one can talk about 
orthogonality and unitary linear mappings, comparable 
with the situation of the Euclidean space R d with its 
standard inner product. 

Having these three norms, namely || ■ ||i, || ■ || 2 , and 
II ■ II oo, it is natural to look for norms “in between.” 
This leads to the L p -spaces, defined by the finiteness 
of ||/||^ := J K d |/(z)|f dz for 1 ^ p < oo. The limit- 
ing case for p — oo is the space L“(R d ) of essentially 
bounded functions. 

Since these spaces are not finite dimensional, it is 
necessary to work with the set of all bounded linear 
functionals, the so-called dual space, which is often, but 
not always, a function space. For 1 ^ p < oo the dual 
space to L p is L q , with 1/p + l/q = 1, meaning that any 
continuous linear functional on L p (R d ) has the form 
/ " /(z)c/(z) dz for aunique function c/ e ITRT 

L 1 (R d ) also appears as the natural domain for the 
Fourier transform, given for s e R d by 

J- f ~ /U) = j t(i /(f) exp j - 2m X S j tj | df, 

while (L 2 (R d ),|| ■ || 2 ) allows us to describe J as a 
unitary (and hence isometric) automorphism. 
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2 Related Function Spaces 

The Lebesgue spaces are prototypical for a much larger 
class of Banach function spaces or Banach lattices , a 
class that also includes Lorentz spaces L(p, q) or Orlicz 
spaces L* , which are all rearrangement invariant. This 
means that for any transformation a : R d — ■ that has 
the property that it preserves the (Lebesgue) measure 
\M\ of a set, i.e., with |«(M)| = |M|, one has ||/|| = 
lla*(/)ll, where a*(/)(z) := /(a(z)). 

In contrast, weighted spaces such as Lw (R d ), charac- 
terized by WfWp.w = \\fie\\p < oo, allow us to capture 
the decay of f at infinity using some strictly positive 
weight function w. For applications in the theory of par- 
tial differential equations (PDEs), polynomial weights 
such as w s (x) = (1 + \x\ 2 Y 12 are important. For s ^ 
0, Sobolev spaces J-f s ( R d ) can be defined as inverse 
images of L^ s (R d ) under the Fourier transform. For 
5 G N they consist of those functions that have (in a dis- 
tributional sense) 5 derivatives in L 2 (M. d ). Mixed norm 
L p spaces (using different p-norms in different direc- 
tions) are also not invariant in this sense, but they are 
still very useful. 

A large variety of function spaces arose out of 
the attempt to characterize smoothness, including 
fractional differentiability. Examples are Besov spaces 
B S p A ( R d ) and Triebel-Lizorkin spaces Ff, q (R d )\ the clas- 
sical Sobolev spaces are the only function spaces that 
belong to both families. The origin of this theory is 
in the theory of Lipschitz spaces Lip (a), where the 
range a G (0,1) allows us to express the degree of 
smoothness (differentiability corresponds intuitively to 
the case a = 1). 

3 Wavelets and Modulation Spaces 

Many of the spaces mentioned above are highly rel- 
evant for PDEs, e.g., the description of elliptic PDEs. 
Their characterization using Paley-Littlewood (dyadic 
Fourier) decompositions has ignited wavelet theory. For 
1 < p < oo they can be characterized via (weighted) 
summability conditions of their wavelet coefficients 
with respect to (sufficiently “good”) mother wavelets. 
In the limiting case, one obtains the real Hardy space 
H 1 (R d ) and its dual, the BMO-space, which consists 
of functions of bounded mean oscillation. Both spaces 
are important for the study of Calderon-Zygmund 
operators or the Hardy-Littlewood maximal operator. 
Wavelets provide unconditional bases for these spaces, 
including Besov and potential spaces. 


For the affine “ax + b”-group acting on the space 
L 2 (R d ), function spaces are defined using the contin- 
uous wavelet transform, and atomic characterizations 
(involving Banach frames) of the above smoothness 
spaces are obtained. Alternatively, the Schrodinger rep- 
resentation of the Heisenberg group, again on L 2 (R d ) 
via time-frequency shifts, gives rise to the family 
of modulation spaces Ml„. They were introduced as 
Wiener amalgam spaces on the Fourier transform side, 
using uniform partitions of unity (instead of dyadic 
ones). 

Using engineering terminology, the now-classical 
spaces M^(R d ) are characterized by the behavior 
of the short-time Fourier transform of their members 
(replacing the continuous wavelet transform). They 
play an important role in time-frequency analysis, and 
their atomic characterizations use Gabor expansions. 

A variety of Banach spaces of analytic or polyana- 
lytic functions play an important role in complex analy- 
sis. Again, integrability conditions over their domain 
are typically used to define these spaces. The corre- 
sponding L 2 -spaces are typically reproducing kernel 
Hilbert spaces, with good localization of these kernels 
allowing one to view them as continuous mappings on 
(weighted, mixed-norm) L p -spaces as well. We mention 
some of the spaces that are important in the context of 
complex analysis or Toeplitz operators: Fock spaces, 
Bergman spaces, and Segal-Bargmann spaces. 

4 Variations of the Theme 

One of the first important examples of a Banach space 
of functions was the space BV of functions of bounded 
variation. One simple characterization of functions of 
this type (the so-called Jordan decomposition) is that 
they are the difference of two bounded and nonde- 
creasing functions (the ascending part of the function 
minus the descending part of it). Via Fourier-Stieltjes 
integrals, F. Riesz showed that there is a one-to-one cor- 
respondence between the dual space of (C[0, 1], II ■ IU) 
and BV[0, 1] endowed with the variation norm. More 
recently, total variation in a two-dimensional setting 
has been fundamental to image restoration algo- 
rithms [VII.8]. Another family of function spaces that 
captures variation at different scales are the Morrey- 
Campanato spaces. 

In addition to Banach spaces of functions there are 
also topological vector spaces and Frechet spaces of 
functions, among them the spaces of test functions that 
are used in distribution theory. Generalized functions 
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consist of the continuous linear functionals on such 
spaces. The “theory of function spaces” as developed 
by Hans Triebel includes a large variety of Banach 
spaces of such generalized functions (or distributions). 

Further Reading 

Ambrosio, L., N. Fusco, and D. Pallara. 2000. Functions 
of Bounded Variation and Free Discontinuity Problems. 
Oxford: Clarendon. 
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sis. Boston, MA: Birkhauser. 

Leoni, G. 2009. A First Course in Sobolev Spaces. Providence, 
RI: American Mathematical Society. 

Meyer, Y. 1992. Wavelets and Operators, translated by D. H. 

Salinger. Cambridge: Cambridge University Press. 

Stein, E. M. 1970. Singular Integrals and Differentiability 
Properties of Functions. Princeton, NJ: Princeton Univer- 
sity Press. 

Triebel, H. 1983. Theory of Function Spaces. Basel: Birk- 
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II. 1 6 Graph Theory 

Timothy A. Davis and Yifan Hu 


At the back of most airline magazines you will find 
a map of airports and the airline routes that connect 
them. This is just one example of a graph, a widely 
used mathematical entity that represents relationships 
between discrete objects. More precisely, a graph G = 
(V,E) consists of a set of nodes V and a set of edges 
E £ {(i,j) I i,j& V} that connect them. A graph is not 
a diagram but it can be drawn, as illustrated in figure 1. 

Graphs arise in a vast array of applications, includ- 
ing social networks (a node is a person and an edge is a 
relationship between two people), computational fluid 
dynamics (a node is an unknown such as the pressure 
at a certain point and an edge is the physical connec- 
tion between two unknowns), finding things on the web 
(a node is a web page and an edge is a link), circuit sim- 
ulation (the wires are the edges), economics (a node is a 
financial entity and the edges represent trade between 
two entities), and many others. 

In some problems, an edge connects in both direc- 
tions, and in this case the graph is undirected. For exam- 
ple, friendship is mutual, so if Alice and Bob are friends, 
the edges (Alice, Bob) and (Bob, Alice) are the same. In 
other cases, the direction of the edge is important. If 
Alice follows Bob on Twitter, this does not mean that 
Bob follows Alice. In this directed graph, the edge (Alice, 
Bob) is not the same as the edge (Bob, Alice). 


(a) (b) 


Figure 1 Two example graphs: 

(a) undirected and (b) directed. 

In a simple graph, an edge (i, j) can appear just once, 
but in a multigraph it can appear multiple times (E 
becomes a multiset). Simple graphs do not have self- 
edges (i, i), but a pseudograph can have multiple edges 
and self-edges. The airline route map in the back of 
the magazine is an example of a simple undirected 
graph. Representing each flight for a whole airline 
would require a directed multigraph: the flight from 
Philadelphia to New York is not the same as the flight 
in the opposite direction, and there are many flights 
each day between the two airports. If sightseeing tours 
are added (self-edges), then a pseudograph would be 
needed. 

The adjacency set of a node i, also called its neigh- 
bors, is the set of nodes j where edge (i,j) is in the 
graph. For a directed graph, this is the out-adjacency; 
the in-adjacency of node i is the set {j \ (j, i) £ E}. A 
graph can be represented as a binary adjacency matrix, 
with entries ay = 1 if (i, j) £ E, and ay = 0 otherwise. 
The degree of a node is the size of its adjacency set. 

Graphs can contain infinite sets of nodes and edges. 
Consider the directed graph on the natural numbers 
N with the edges (i,j), where j is an integer multi- 
ple of i. A prime number j > 1 in this graph has in- 
adjacency {1 ,j} and an in-degree of 2 (including the 
self-edge (j, j)); a composite number j > 1 has a larger 
in-degree. 

Nodes i and j are incident on the edge (i,j) and, 
likewise, the edge (i,j) is incident on its two nodes. 
A subgraph of G consists of a subset of its nodes and 
edges, G = ( V,E ), where V £ V and E £ E. If an edge 
(i, j) appears in E, then its two incident nodes must 
also appear in V, but the opposite need not hold. Two 
special kinds of subgraphs are node-induced and edge- 
induced subgraphs. A node-induced subgraph starts 
with a subset of nodes V; the edges E are all those 
edges whose two incident nodes are both in V. An edge- 
induced subgraph starts with a subset of edges E and 
then V consists of all nodes incident on those edges. A 
graph is completely connected if it has an edge between 
every pair of nodes. A clique is a completely connected 
subgraph. 
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A path from i to j is a list of nodes (i, . . . ,j) with 
edges between adjacent pairs of nodes. The path can- 
not traverse a directed edge backward. The length of 
the path is the number of nodes in the list minus one. 
In a simple path, a node can appear only once. If there is 
a path from i to j, then node j is reachable from node 
i. The set of all nodes reachable from i is the reach of 
i. Among all paths from i to j, one with the shortest 
length is a shortest path , its length the (geodesic) dis- 
tance from i to j. The diameter of a graph is the length 
of the longest possible shortest path. In a small-world 
graph, each node is a small distance (logarithmic in the 
number of nodes) away from any other node. 

An undirected graph is connected if there is a path 
between each pair of nodes, but there are two kinds 
of connectivity in a directed graph. If a path exists 
between every pair of nodes, then a directed graph 
is strongly connected. A directed graph is weakly con- 
nected if its underlying undirected graph is connected; 
to obtain such a graph, all edge directions are dropped. 

A cycle is a path that starts and ends at the same node 
i; the cycle is simple if no node is repeated (except for 
node i itself). There are no cycles in an acyclic graph. 
The acronym DAG is often used for a directed acyclic 
graph. 

The undirected graph in figure 1(a) is connected. 
Nodes {2,3,4} form a clique, as do {2,4,6}. The path 
(1,2, 4, 3, 2, 6) has length 5 and is not simple. A sim- 
ple path from 1 to 6 is (1, 2, 4, 6) of length 3, but the 
shortest path is (1,2,6) of length 2, which traverses 
the edges (1,2) and (2,6). The path (2, 3, 4, 2) is a 
cycle of length 3. Node 2 has degree 4, with neighbors 
{1,3, 4, 6}. The diameter of the graph is 3. Since the 
graph is connected, the reach of node 2 is the whole 
graph. This graph is the underlying undirected graph 
of the directed graph in part (b) of the figure. 

The largest clique in this directed graph has only two 
nodes: {2,4}. The out-adjacency of node 2 is the set 
{3,4} and its in-adjacency is { 1 , 4, 6} . The reach of node 
2 is {2, 3, 4, 6}. The graph is not strongly connected 
since there is no path from 1 to 5, but it is weakly 
connected since its underlying undirected graph is 
connected. 

Figure 2 illustrates a bipartite graph. The nodes of 
a bipartite graph are partitioned into two sets, and 
no edge in the graph is incident on a pair of nodes 
in the same partition. Bipartite graphs arise naturally 
when modeling a relationship between two very dif- 
ferent sets. For example, in term/document analysis, 
a bipartite graph of m terms and n documents has an 



Figure 3 A tree of height 4, with node 1 as the root. 

edge (i,j) if term i appears in document j. No edge 
connects two terms, nor two documents. An undirected 
bipartite graph is often represented as a rectangular 
mx n adjacency matrix, where ay = 1 if the edge ( i,j ) 
appears in the graph and ay = 0 otherwise. 

An undirected acyclic graph is a forest. An important 
special case is a tree, which is a connected forest. In a 
tree, there is a unique simple path between each pair 
of nodes. In a rooted tree, one node is designated as 
the root. The ancestors of node i are all the nodes on 
the path from i to the root (excluding i itself). The first 
node after i in this path is the parent of i, and node 
i is its child. The length of this path is the level of the 
node (the root has level zero). The height of a tree is 
the maximum level of its nodes. 

In a tree, all nodes except the root have a single par- 
ent. Nodes can have any number of children, and a node 
with no children is a leaf. Internal nodes have at least 
one child. In a binary tree, nodes have at most two chil- 
dren, and in a full binary tree, all internal nodes have 
exactly two children. Node i is a descendant of all nodes 
in the path from i to the root (excluding i itself). The 
subtree rooted at node i is the subgraph induced by 
node i and its descendants. 

In the example in figure 3, the parent of node 5 is 
2, its descendants are {8,9,10,11}, its ancestors are 
{1,2}, and its children are {8,9}. Since node 2 has three 
children, the tree is not binary. 

Sometimes a graph with its nodes and edges is not 
enough to fully represent a problem. Edges in a graph 
do not have a length, but this is useful for the airline 
route map, and thus nodes and edges are often aug- 
mented with additional data. Attaching a single numer- 
ical value to each node and/or edge is common; this 
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results in a weighted graph. In this graph, the length of 
a path is the sum of the weights of its edges. 

Drawing a graph requires node positions, so they 
must be computed for graphs without natural node 
positions. In the force-directed method that created 
the graphs in plates 1-4, nodes are given an electrical 
charge and edges become springs. A low-energy state 
is found, which often leads to a visually pleasing layout 
that reveals the graph’s large-scale structure. 

Be aware that there are many minor variations in the 
terminology of graph theory. For example, nodes can be 
called vertices, edges are also called arcs, and self-edges 
are sometimes called loops. Directed graphs are also 
called digraphs. Sometimes the term arc is restricted to 
directed edges. In a common alternative terminology, a 
path becomes a walk, a simple path becomes a path 
(conflicting with the definition here), a trail is a walk 
with no repeated edges, and a cycle becomes a closed 
walk. 

Further Reading 
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11.17 Homogenization 


Homogenization is a method for obtaining an equation 
whose solution approximates the solution of a partial 
differential equation (PDE) with rapidly varying coef- 
ficients. In many cases the approximate equation is 
a PDE, but in some it may be an integro-differential 
equation. PDEs with rapidly varying coefficients arise 
in the study of the response of composite materials. 
The method is best illustrated by an example in heat 
conduction. Let u(x i,x 2 ) represent temperature in a 
composite medium occupying a domain Q. The com- 
posite is modeled by a conductivity a > 0 that is 
rapidly oscillating in both xi and X 2 - To represent this 
mathematically, the conductivity is written as 

a{X\l£,X2ls), 

where the function a(y\ ,y 2 ) is unit periodic, i.e., 
a(y\ + l,yz) = a(yi,y 2 ) and a(yi,y 2 + 1) = 
a(y 1 ,y 2 ). Therefore, one can see that £ represents 
the periodicity of the medium and is small when the 


medium is rapidly oscillating. The governing equation 
is 

V ■ aVit = /, 

where f represents the heat source. 

Homogenization, which is based on multiscale as- 
ymptotics, allows one to obtain an approximate equa- 
tion for the heat conduction problem when £ <s 1. The 
resulting equation, 

V ■ AVu = f, 

is referred to as the homogenized equation and has a 
constant, albeit anisotropic (A may be a matrix), con- 
ductivity. The conductivity in the homogenized equa- 
tion in this particular case is easily identified as A and 
is referred to as the effective property of the medium. 

Periodicity is essential in deriving the homogenized 
equation. When the medium is not periodic, e.g., when 
it is random, the method of homogenization may 
still be applied, although its interpretation is differ- 
ent and more difficult. In this case there is no sim- 
ple way to obtain the effective property of the ran- 
dom medium, but a connection to effective medium 
theories [IV. 3 1 ] can be made. 

For periodic media, homogenization can be applied 
to PDEs that model other phenomena such as fluid 
flow in porous media, elasticity, wave propagation, 
vibration, electrostatics, and electromagnetics. Recent 
research in homogenization has explored its use in sit- 
uations where the medium is almost periodic, that is, in 
PDEs where the coefficients are oscillatory with almost 
constant period or where the coefficients are periodic 
with the exception of a few “defect” regions. 


11.18 Hybrid Systems 

Paul Glendinning 


A hybrid system is a system that combines both contin- 
uous and discrete variables. A simple motivating exam- 
ple is the thermostat. A thermostat switches a heater 
to one of two discrete states, on or off, depending 
on the ambient temperature (a continuous variable), 
which evolves differently if the heater is in the on or 
off states. The specification of the thermostat includes 
transition rules. If the thermostat is in the off state 
and the temperature falls below a critical value, then 
the thermostat is switched to on, possibly with some 
(deterministic or random) delay in time. If the thermo- 
stat is in the on state and the temperature rises above 
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a critical value, then the thermostat is switched to off, 
possibly with some (deterministic or random) delay in 
time. 

This type of system— for which the value of a discrete 
variable determines the rules of evolution of continu- 
ous variables, and with jumps in the discrete variables 
according to another rule — provides a formal descrip- 
tion of many systems in computer science, control 
theory, and dynamics. The bouncing ball is another sim- 
ple example and the reader may like to think about how 
this fits into the formalism described below. 

There are many different ways a hybrid system could 
be defined, and the choice of definition will depend on 
the application area or the level of generality that is 
required. However, the basic structure is as follows. A 
hybrid system’s state is determined by two variables, 
a continuous variable x on some manifold M (often 

R”) and a discrete variable d e {1,2 D}. For each 

d there is a domain Md in M and an evolution equa- 
tion that determines the time evolution of x. This may 
be a differential equation or a stochastic differential 
equation, for example. Therefore, given d there is a 
well-defined rule for the evolution of x in a subset of M. 

The next part of the specification determines how the 
discrete variable can change. For each pair e = (do,di) 
there is a set G e £ Md 0 , possibly empty, such that 
if x G G e then there is a finite probability that the 
dynamics will be instantaneously reset so that the dis- 
crete variable becomes d \ and the continuous variable 
becomes R(x, e) for some reset map R : MxDxD — ■ M. 
After this, the dynamics continues in the new domain 
Md, according to the evolution rule for d i until the next 
transition is made. 

Of course, for this specification to make sense a num- 
ber of consistency conditions must be satisfied, e.g., 
R(x,e) G Md j, and it is possible that the continuous 
variable may leave the region Md on which its dynam- 
ics is defined before a transition is made, in which 
case the dynamics becomes undefined. There are there- 
fore many checks that need to be made to ensure that, 
given an initial condition in some set Init £ M x D, 
the solution is well defined. In the deterministic case 
uniqueness may also be an issue. 

A solution with an infinite sequence of transitions in 
finite time is called a Zeno solution. In computer sci- 
ence, where the transitions are executions in some pro- 
gram, this is not useful. In other contexts, however, the 
infinite sequence may have physical significance. Chat- 
tering in mechanical impacts can be modeled as a Zeno 
phenomenon. 


Hybrid systems are used to examine reachability and 
verification problems in computer science. The reach- 
ability problem is that of determining all the possible 
states that can be reached from a given set of initial 
conditions. This is also relevant to control theory 
[IV.34], where it might be important to know that a 
given control system keeps a car on the road or avoids 
collisions. The existence of transitions or jumps makes 
hybrid systems useful in models of mechanical impacts 
as well (see slipping, sliding, rattling, and impact 
[VI.15]). Control systems with thresholds can also be 
modeled as hybrid systems. 

The existence of transitions means that many classic 
results for smooth dynamical systems, such as stability 
theorems, are made considerably harder to prove, but 
it could be argued that the piecewise-smooth nature of 
these models is a much more generally applicable fea- 
ture for modern applications in electronics and biology 
than the smooth dynamical systems approach that has 
dominated so much of dynamical systems theory. 

Further Reading 

Hristu-Varsakelis, D., and W. S. Levine, eds. 2005. Handbook 
ofNetvi’orked and Embedded Control Systems. Boston, MA: 
Birkhauser. 

Lygeros, J. 2004. Lecture Notes on Hybrid Systems. (Digital 
resource.) 


11.19 Integral Transforms and 
Convolution 


An integral transform is an operator J mapping a 
function / of a real variable to another function J f 
according to 

rb 

(Jf)(s)=\ K(s,t)f(t)dt, 

J a 

where K is called the kernel of the transformation. 
There are many different integral transforms, depend- 
ing on the choice of a, b, and K. They are used to trans- 
form one problem into another, the intent being that 
the new problem is simpler than the original one. Hav- 
ing solved the new problem, the solution / is obtained 
by applying the appropriate inverse operator, J -1 ; this 
is often another integral transform. 

Important special cases include the Fourier trans- 
form 

r OO 

(ff)(s)=\ f(t)e ist dt 

J — OO 
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and the Laplace transform 

r CO 

(£f)(s)=\ fit)e~ st dt. 

Jo 

This definition of £ is standard, but the definition of 
f is one of many; for example, some authors insert an 
extra factor ( 2 tt ) ~ 1/2 , and some use e~ Ist . Always check 
the author’s definition when reading a book or article 
in which Fourier transforms are used! 

Many integral transforms have an associated convo- 
lution ; given two functions of a real variable, f and 
g, their convolution is another function of a real vari- 
able, denoted by f * g. It is defined so that J(f * 
g) = (Jf)UgY, the transform of the convolution is the 
product of the transforms. For the Fourier transform 

if * g)(t) = \ f(t-s)g(s)ds, 

J —00 

while for the Laplace transform 

( f*g)(t)= f fit - s)g(s)ds. 

Jo 

It is easy to see that, in both cases, f * g = g * f. Con- 
volution is an important operation in signal process- 
ing [IV.35] and in many applications involving Fourier 
analysis and integral equations [IV.4], 

There are also discrete versions of integral trans- 
forms in which the integral is replaced by a finite sum 
of terms. The discrete Fourier transform is especially 
important because it can be computed rapidly using 
the FAST FOURIER TRANSFORM [II. 10] (FFT). 


11.20 Interval Analysis 

Warwick Tucker 


Interval analysis is a calculus based on set-valued math- 
ematics. In its simplest (and by far most popular) form, 
it builds upon interval arithmetic, which is a natural 
extension of real-valued arithmetic. Despite its sim- 
plicity, this kind of set-valued mathematics has a very 
wide range of applications in computer-aided proofs 
for continuous problems. In a nutshell, interval arith- 
metic enables us to bound the range of a continuous 
function, i.e., it produces a set enclosing the range of 
a given function over a given domain. This, in turn, 
enables us to prove mathematical statements that use 
open conditions, such as strict inequalities, fixed-point 
theorems, etc. 

1 Interval Arithmetic 

In this section we will briefly describe the fundamentals 
of interval arithmetic. Let M denote the set of closed 


intervals of the real line. For any element a e M, we 
use the notation a = [a, a]. If * is one of the operators 
+ , - , x, /, we define arithmetic on elements of a, b e M 
by 

a * b = {a + b\ a e a, be b}, (1) 

except that a/b is undefined if 0 e b. Working exclu- 
sively with closed intervals, the resulting interval canbe 
expressed in terms of the endpoints of the arguments. 
This makes the arithmetic very easy to implement in 
software. 

Note that a generic element in UR has no addi- 
tive or multiplicative inverse. For example, we have 
[1,2] - [1,2] = [-1,1] * [0,0], and [1,2]/[1,2] = 
[^,2] * [1,1]. This is known as the dependency prob- 
lem, and it can cause large overestimations. In prac- 
tice, however, the use of high-order (e.g., Taylor series) 
representations greatly mitigates this problem. 

A key feature of interval arithmetic is that it is inclu- 
sion monotonic, i.e., if a e a' and b £ b' , then by (1) we 
have 

a * b Q a’ * b' . 

This is of fundamental importance: it says that, if we 
can enclose the arguments, we can enclose the result. 

More generally, when we extend a real-valued func- 
tion f to an interval-valued one F, we demand that it 
satisfies the inclusion principle 

range(/;x) = {/(x): x e x] £ F(x). (2) 

If this can be arranged for a finite set of standard func- 
tions, then the inclusion principle will also hold for 
any elementary function constructed by arithmetic and 
composition applied to the set of standard functions. 

Multivariate functions can be handled by work- 
ing componentwise on interval vectors (boxes) x = 
ixi x n ). 

When implementing interval arithmetic on a com- 
puter, the endpoints must be floating-point num- 
bers [11.13]. This introduces rounding errors, which 
must be properly dealt with. As an example, interval 
addition becomes 

a + b = [V(a + b), A (a + b)]. 

Here, V(x) is the largest floating-point number no 
greater than x, and A(x) = -V(-x). The IEEE stan- 
dard for floating-point computations guarantees that 
this type of outward rounding preserves the inclusion 
principle for +, -, x, and /. For other operations (such 
as trigonometric functions) there are no such assur- 
ances; interval extensions of these functions must be 
built from scratch. 
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Figure 1 Successively tighter enclosures of a graph. 


2 Interval Analysis 

The inclusion principle (2) enables us to capture contin- 
uous properties of a function, using only a finite num- 
ber of operations. Its most important use is to explic- 
itly bound discretization errors that naturally arise in 
numerical algorithms. 

As an example, consider the function fix) = cos 3 x+ 
sin x on the domain x = [-5,5]. For any decompo- 
sition of the domain x into a finite set of subinter- 
vals x = U"=i*<, we can form the set-valued graph 
consisting of the pairs (xi , F (x \ )),..., (x n , F (x n ))■ As 
the partition is made finer (that is, as max,- diam(Xf) is 
made smaller), the set-valued graph tends to the graph 
of / (see figure 1). And, most importantly, every such 
set-valued graph contains the graph of /. 

This way of incorporating the discretization errors 
is extremely useful for quadrature, optimization, and 
equation solving. As one example, suppose we wish to 
compute the definite integral I = Jq sin(x + e x ) dx. 

A MATLAB function si mpson that implements a sim- 
ple textbook adaptive Simpson quadrature algorithm 
produces the following result. 

% Compute integral I with tolerance le-6. 

» I = simpson(@(x) sin(x + exp(x)), 0, 8) 

I = 

0.251102722027180 

A ( very naive) set-valued approach to quadrature is to 
enclose the integral I via 

n 

I 6 ^F(Xi) diam(Xj), 
i=l 


which, for a sufficiently fine partition, produces the 
integral enclosure 

I e 0.34740017264®. 

Thus, it turns out that the result from simpson was 
completely wrong! This is one example of the impor- 
tance of rigorous computations. 

3 Recent Developments 

There is currently an ongoing effort within the IEEE 
community to standardize the implementation of inter- 
val arithmetic. The hope is that we will enable com- 
puter manufacturers to incorporate these types of com- 
putations at the hardware level. This would remove 
the large computational penalty incurred by repeat- 
edly having to switch rounding modes— a task that 
central processing units were not designed to perform 
efficiently. 
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11.21 Invariants and Conservation 

Laws 

Mark R. Dennis 

As important as the study of change in the mathemati- 
cal representation of physical phenomena is the study 
of invariants. Physical laws often depend only on the 
relative positions and times between phenomena, so 
certain physical quantities do not change; i.e., they are 
invariant, under continuous translation or rotation of 
the spatial axes. Furthermore, as the spatial configura- 
tion of a system evolves with time, quantities such as 
total energy may remain unchanged; that is, they are 
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conserved. The study of invariants has been a remark- 
ably successful approach to the mathematical formu- 
lation of physical laws, and the study of continuous 
symmetries and conservation laws — which are related 
by the result known as Noether’s theorem — has become 
a systematic part of our description of physics over the 
last century, from the atomic scale to the cosmic scale. 

As an example of so-called Galilean invariance, New- 
ton’s force law keeps the same form when the velocity 
of the frame of reference (i.e., the coordinate system 
specified by x-, y-, and z-axes) is changed by adding 
a constant; this is equivalent to adding the same con- 
stant velocity to all the particles in a mechanical sys- 
tem. Other quantities do change under such a veloc- 
ity transformation, such as the kinetic energy ^m\v\ 2 
(for a particle of mass m and velocity v); however, for 
an evolving, nondissipative system such as a bouncing, 
perfectly elastic rubber ball, the total energy is constant 
in time— that is, energy is conserved. 

The development of our understanding of fundamen- 
tal laws of dynamics can be interpreted by progres- 
sively more sophisticated and general representations 
of space and time themselves: ancient Greek physi- 
cal science assumed absolute space with a privileged 
spatial point (the center of the Earth), through static 
Euclidean space where all spatial points are equiv- 
alent, through classical mechanics [IV. 19] where 
all inertial frames, moving at uniform velocity with 
respect to each other, are equivalent according to 
Newton’s first law, to the modern theories of spe- 
cial and general relativity. The theory of relativity 
(both general and special) is motivated by Einstein’s 
principle of covariance, which is described below. 
In this theory, space and time in different frames 
of reference are treated as coordinate systems on a 
four- dimensional pseudo-Riemannian manifold (whose 
mathematical background is described in tensors and 
manifolds [11.33]), which manifestly combines conser- 
vation laws and continuous geometric symmetries of 
space and time. In special relativity (described in some 
detail in this article), this manifold is flat Minkowski 
space-time, generalizing Euclidean space to include 
time in a physically natural way. In general relativity, 
described in detail in general relativity and cos- 
mology [IV.40], this manifold may be curved, depend- 
ing in part on the distribution of matter and energy 
according to Einstein’s field equations [III. 10]. 

In quantum physics, the description of a system in 
terms of a complex vector in Hilbert space gives rise 
to new symmetries. An important example is the fact 


that physical phenomena do not depend on the overall 
phase (argument) of this vector. Extension of Noether’s 
theorem here leads to the conservation of electric 
charge, and extension to Yang-Mills theories provides 
other conserved quantities associated with the nuclear 
forces studied in contemporary fundamental particle 
physics. Other phenomena, such as the Higgs mecha- 
nism (leading to the Higgs boson recently discovered 
in high-energy experiments), are a consequence of the 
breaking of certain quantum symmetries in certain low- 
energy regimes. Symmetry and symmetry breaking in 
quantum theory are discussed briefly at the end of this 
article. 

Spatial vectors, such as r = (x,y,z), represent the 
spatial distance between a chosen point and the ori- 
gin, and of course the vector between two such points 
r *2 - ri is independent of translations of this origin. 
Similarly, the scalar product r 2 ■ r \ is unchanged under 
rotation of the coordinate system by an orthogonal 
matrix R, under which r — ■ Rr. 

Continuous groups of transformations such as trans- 
lation and rotation, and their matrix representations, 
are an important tool used in calculations of invari- 
ants. For example, the set of two-dimensional matrices 
( sine cos'/)’ representing rotations through angles 0, 
may be considered as a continuous one-parameter Abe- 
lian group of matrices generated by the matrix expo- 
nential [11.14] e eA , where A is the generator (i~q). 
The generator itself is found as the derivative of the 
original matrix with respect to 6, evaluated at 6 = 0. 
Translations are less obviously represented by matri- 
ces; one approach is to append an extra dimension to 
the position vector with unit entry, such as (1, x) spec- 
ifying one-dimensional position x; a translation by X is 
thus represented by 


1 0 
X 1 


exp X 



( 1 ) 


When a physical system is invariant under a one- 
parameter group of transformations, the correspond- 
ing generator plays a role in determining the associated 
conservation law. 


1 Mechanics in Euclidean Space 

It is conventional in classical mechanics to define the 
positions of a set of interacting particles in a vector 
space. However, we do not observe any unique origin 
to the three-dimensional space we inhabit, which we 
therefore take to be the Euclidean space E 3 ; only rela- 
tive positions between different interacting subsystems 
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(i.e., positions relative to the common center of mass) 
enter the equations of motion. The entire system may 
be translated in space without any effect on the phe- 
nomena. 

Of course, external forces acting on the system may 
prevent this, such as a rubber ball in a linear gravity 
field. (In such situations the source of the force, such 
as the Earth as the source of gravity, is not considered 
part of the system.) The gravitational force may be rep- 
resented by a potential V = gz for height z and gravita- 
tional acceleration cj; the ball’s mass m times the nega- 
tive gradient, -mVV, gives the downward force acting 
on the ball. The contours of V, given by z = const., 
nevertheless have a symmetry: they are invariant to 
translations of the horizontal coordinates x and y. 
Since the gradient of the potential is proportional to 
the gravitational force — which, by virtue of Newton’s 
law equals the rate of change of the particle’s linear 
momentum— the horizontal component of momentum 
does not change and is therefore conserved even when 
the particle bounces due to an impulsive, upward force 
from the floor. The continuous, horizontal translational 
symmetry of the system therefore leads to conserva- 
tion of linear momentum in the horizontal plane. In a 
similar argument employing Newton’s laws in cylindri- 
cal polar coordinates, the invariance of the potential to 
rotations about the z-axis leads to the conservation of 
the vertical component a body’s angular momentum, 
as observed for tops spinning frictionlessly. 

2 Noether’s Theorem 

The Lagrangian framework for mechanics (classical 
mechanics [IV. 1 9 §2]), which describes systems acting 
under forces defined by gradients of potentials (as in 
the previous section), is a natural mathematical setting 
in which to explore the connection between a system’s 
symmetries and its conservation laws. Here, a mechan- 
ical system evolving in time t is described by n gener- 
alized coordinates <ij(t) and their time derivatives qj, 
for j = 1, ... ,n, where the initial values qj(to) at time 
to and final values qj(ti) at t \ are fixed. The action of 
the system is the functional 

S[{qj}] = f dt, 

Jt 0 

where L({qj}, {qj}, t ) is the Lagrangian ; this is a func- 
tion of the coordinates, their corresponding velocities, 
and maybe time, specified here by the total kinetic 
energy minus the total potential energy of the system 


(thereby capturing the forces as gradients of the poten- 
tial energy). Using the calculus of variations [IV.6], 
the functions qj(t) that satisfy the laws of mechanics 
are those that make the action stationary, and these 
satisfy Lagrange’s equations of motion 


dL 

dqj 


A^ = 0 , 

dt dqj 


1 n. 


( 2 ) 


The argument of the time derivative in this expression, 
dL/dqj, is called the canonical momentum pj for each 
j. The set of equations (2) involves the combination 
of partial derivatives of the Lagrangian with respect to 
the coordinates and velocities, together with the total 
derivative with respect to time. By the chain rule, this 
total derivative affects explicit time dependence in L 
and the implicit time dependence in each qj and q , . 
Many of the conservation laws involving Lagrangians 
involve such an interplay of explicit and implicit time 
dependence. 

Any transformation of the coordinates qj that does 
not change the Lagrangian is a symmetry of the system. 
If L does not have explicit dependence on a coordinate 
qj, then the first term in (2) vanishes: dpj/dt = 0, i.e., 
the corresponding canonical momentum is conserved 
in time. In the example from the last section of a par- 
ticle in a linear gravitational field, the coordinates can 
be chosen to be Cartesian x,y,z, or cylindrical polars 
r, 4>,z ; L is independent of x and y, leading to con- 
servation of horizontal momentum, and also <p, lead- 
ing to conservation of angular momentum about the 
z-axis. The theorem is proved for symmetries of this 
type in classical mechanics [IV.19 §2.3]: if a system 
is homogeneous in space (translation invariant), then 
linear momentum is conserved, and if it is isotropic 
(independent of rotations, such as the Newtonian gravi- 
tational potential around a massive point particle exert- 
ing a central force), then angular momentum is con- 
served (equivalent to Kepler's second law of planetary 
motion for gravity). 

Since it is the equations (2) that represent the phys- 
ical laws rather than the form of L or S, the system 
may admit a more general kind of symmetry whose 
transformation adds a time-dependent function to the 
Lagrangian L. If, under the transformation, the Lagran- 
gian transforms L — L + dA/dt involving the total time 
derivative of some function A, the action transforms 
S — S + A(ti) - A(to). Thus the transformed action 
is still made stationary by functions satisfying (2), so 
transformations of this kind are symmetries of the sys- 
tem, which are also continuous if A also depends on a 
continuous parameter s so that its time derivative is 
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zero when s = 0. It is not then difficult to see that the 
quantity 


Xpj 


i = i 


dqj_ 

ds 


s = 0 


dA 

ds s =o’ 


(3) 


defined in terms of the generators of the transforma- 
tion on each coordinate and the Lagrangian, is con- 
stant in time. This is Noether’s theorem for classical 
mechanics. 

An important example is when the Lagrangian has 
no explicit dependence on time, dL/dt = 0. In this 
case, under an infinitesimal time translation t — t + 5t , 
L — L + 5tdL/dt, so here A is LSt, with! evaluated at t, 
and St plays the role of 5. Under the same infinitesimal 
transformation, qj — qj + Stqj, so the relevant con- 
served quantity (3) is Y.j Pjdj ~ L, which is the Hamil- 
tonian of the system, which is equal to the total energy 
in many systems of interest. It is apparently a funda- 
mental law of physics that the total energy in physical 
processes is conserved in time; energy can be in other 
forms such as electromagnetic, gravitational, or heat, 
as well as mechanical. Noether’s theorem states that 
the law of conservation of energy is equivalent to the 
fact that the physical laws of the system, characterized 
by their Lagrangian, do not change with time. 

The vanishing of the action functional’s integrand 
(i.e., the Lagrangian L) is equivalent to the existence 
of a first integral for the system of Lagrange equa- 
tions, which is interpreted in the mechanical setting 
as a constant of the motion of the system. In this 
sense, Noether’s theorem may be applied more gener- 
ally in other physical situations described by function- 
als whose physical laws are given by the corresponding 
Euler-Lagrange equations. In the case of the Lagran- 
gian approach applied to fields (i.e., functions of space 
and time), Noether’s theorem generalizes to give a con- 
tinuous density p (such as mass or charge density) 
and a flux vector J satisfying the continuity equation 
p + V ■ J = 0 at every point in space and time. 


3 Galilean Relativity 

Newton’s first law of motion can be paraphrased as “all 
inertial frames, traveling at uniform linear velocity with 
respect to each other, are equivalent for the formula- 
tion of mechanics”— that is, without action of external 
forces, a system will behave in the same way regardless 
of the motion of its center of mass. The behavior of a 
mechanical system is therefore independent of its over- 
all velocity; this is a consequence of Newton’s second 


law, that force is proportional to acceleration. Accord- 
ing to pre-Newtonian physics, forces were thought to be 
proportional to velocity (as the effect of friction was not 
fully appreciated), and it was not until Galileo’s thought 
experiments in friction-free environments that the pro- 
portionality of force to acceleration was appreciated. In 
spite of Galilean invariance, problems involving circu- 
lar motion do in fact seem to require a privileged frame 
of reference, called absolute space. One example due to 
Newton himself is the problem of explaining, without 
absolute space, the meniscus formed by the surface of 
water in a spinning bucket; such problems are properly 
overcome only in general relativity. 

With Galilean relativity, absolute position is no longer 
defined: events occurring at the same position but at 
different times in one frame (such as a moving train car- 
riage) occur at different positions in other frames (such 
as the frame of the train track). However, changes to the 
state of motion, i.e., accelerations, have physical conse- 
quences and are related to forces. This is an example of 
a covariance principle, whose importance for physical 
theories was emphasized by Einstein. According to this 
principle, from the statement of physical laws in one 
frame of reference (such as the laws of motion), one 
can derive their statement in a different frame of refer- 
ence from the application of the appropriate transfor- 
mation rule between reference frames. The statement 
in the new frame should have the same mathematical 
form as in the previous frame, although quantities may 
not take the same values in different frames. 

Transformations between different inertial frames 
are represented mathematically in a similar way to the 
translations of (1); events are labeled by their positions 
in space and time, such as ( t,x ) in one frame and 
(t,x') in another moving at velocity v with respect to 
the first. Since x' = x - vt, the transformation from 
(t,x) to (t,x') is represented by the matrix (} v J). 
This Galilean transformation (or Galilean boost ) differs 
from (1) in that time t is here appended to the posi- 
tion vector, since the translation from the boost is time- 
dependent. Galilean boosts in three spatial dimensions, 
together with regular translations and rotations, define 
the Galilean group. It can be shown that the Lagrangian 
of a free particle follows directly from the covariance 
of the corresponding action under the Galilean group. 

Infinitesimal velocity boosts generate a Noetherian 
symmetry on systems of particles interacting via forces 
that depend only on the positions of the others. Con- 
sider N point particles of mass m k and position rt such 
thatV depends only on for k,f = 1,..., AT. Under 
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an infinitesimal boost by Sv for each k, + tSv, 

tTk + Sv, and 

L I + j\8v\ 2 ^ mj + Sv ■ ^ Ytik't'k- 

k k 

The resulting Noetherian conserved quantity (3) is 

c = t X - X mien- (4) 

k k 

This quantity is constant in time; its value is minus the 
product of the system’s total mass with the position of 
its center of mass when t = 0. Thus, C = f Xfc wijtr, 
that is, t times the sum of the forces on each of the N 
particles, which is zero by assumption. 


4 Special Relativity 


Historically, the physics of Helds, such as electromag- 
netic radiation described by maxwell’s equations 
[HI.22], was developed much later than the mechanics 
of Galileo and Newton. The simplest wave equations, in 
fact, suggest a different invariant space-time from the 
Galilean framework above, which was first investigated 
by Hendrik Lorentz and Albert Einstein at the turn of 
the twentieth century. 

Consider the time-dependent wave equation V 2 c p - 
c _2 <p = 0 for some scalar function qp(t,r) and V 2 the 
laplace operator [III. 18]. If all observers agree on the 
same value of the wave speed c (as proposed by Ein- 
stein for c the speed of light, and verified by experi- 
ments), this suggests that the transformation between 
moving inertial frames should be in terms of Lorentz 
transformations (or Lorentz boosts) in the x-direction, 



where (ct',x') denotes position in space and time (mul- 
tiplied by the invariant speed c) in a frame moving 
along the x-axis at velocity v with respect to the frame 
with space-time coordinates (ct,x), and 

y(v) = (1 - v 2 / c 2 )~ 1/2 . 


Requiring y(v ) to be finite, positive, and real-valued 
implies \v\ < c. According to the Lorentz transforma- 
tion with x = 0, t' = y{v)t, suggesting that y(v) may 
be interpreted as the rate of how of time in the trans- 
formed frame moving at speed v with respect to time 
in the reference frame. 

According to Einstein’s theory of special relativity, 
all physical laws should be invariant to translations, 
rotations, and Lorentz boosts, and Galilean transforma- 
tions and Newton’s laws arise in the low-velocity limit. 


Rotations and Lorentz boosts, together with time reflec- 
tion and spatial inversion, define the Lorentz group', 
the Poincare group is the semidirect product group of 
the Lorentz group with space-time translations. Spe- 
cial relativity is therefore also based on the principle of 
covariance, but with a different set of transformations 
between frames of reference. All of the familiar results 
of classical mechanics [IV. 19] are recovered when 
the boost transformations are restricted to u « c. 

In special relativity different observers measure dif- 
ferent time intervals between events, as well as differ- 
ent spatial separations. For a relativistic particle, differ- 
ent observers will disagree on the particle’s relativistic 
energy E and momentum p. Nevertheless, according to 
the principle of special relativity, all observers agree on 
the quantity 

E 2 - |p| 2 c 2 = m 2 c 4 , 

so m is an invariant on which all observers agree, called 
the rest mass of the particle. In the frame in which the 
particle is at rest, E = me 2 arises as a special case. 
Otherwise, E = my(v)c 2 , so in special relativity the 
moving particle’s energy is explicitly related to the flow 
of time in the particle’s rest frame with respect to that 
in the reference frame. 

The Lagrangian formalism can be generalized to spe- 
cial relativity, with all observers agreeing on the form 
of the Euler-Lagrange equations; instead of being equal 
to its kinetic energy, the Lagrangian of a free particle is 
-me 2 ly(v). 

The set of space-time events ( ct,r ) described by 
special relativity is known as Minkowski space or 
Minkowski space-time (discussed in tensors and man- 
ifolds [11.33]). It is a flat four-dimensional manifold 
with one time coordinate and three space coordinates 
(i.e., a manifold with a 3 + 1 pseudo-Euclidean metric), 
and the inertial frames with perpendicular spatial axes 
are analogous to Cartesian coordinates in Euclidean 
space. Time intervals and (simultaneous) spatial dis- 
tances between events are no longer separately invari- 
ant; only the space-time inten'al 

s 2 = \Ar\ 2 - c 2 At 2 

between two events with spatial separation Ar and time 
separation At is invariant to Lorentz transformations 
and hence takes the same value in all inertial frames. 
s 2 may be positive, zero, or negative: if positive, there 
is a frame in which the pair of events are simultaneous; 
if zero, the events lie on the trajectory of a light ray; if 
negative, there is a frame in which the events occur at 
the same position. Since nothing can travel faster than 
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c, only events with s 2 ^ 0 can be causally connected; 
the events that may be affected by a given space-time 
point are those within its future light cone. 

Being a manifold, the symmetries of Minkowski space 
are easily characterized, and its continuous symme- 
tries are generated by vector fields, called Killing vec- 
tor fields', dynamics in the space-time manifold is 
not changed under infinitesimal pointwise translation 
along these fields. The symmetries are translations in 
four linearly independent space-time directions, rota- 
tion about three linearly independent spatial axes, and 
Lorentz boosts in three linearly independent spatial 
directions (which are equivalent to rotations in space- 
time that mix the i-direction with a spatial direc- 
tion). There are therefore ten independent symmetry 
transformations in Minkowski space-time (generating 
the Poincare group), such that, for special relativis- 
tic systems not experiencing external forces, relativis- 
tic energy, momentum, angular momentum, and the 
analogue of C in (4) are conserved. 

Both electromagnetic fields described by maxwell’s 
equations [III.22] and relativistic quantum matter 
waves (strictly for a single particle) described by the 
dirac equation [III. 9] can be expressed as Lagran- 
gian field theories; the combination of these is called 
classical field theory. As described above, Noether’s 
theorem relates continuous symmetries of these the- 
ories to continuity equations of relativistic 4-currents. 
For instance, space-time translation symmetry ensures 
that the relativistic rank-2 stress-energy-momentum 
tensor, which describes the space-time flux of energy 
(including matter), is divergence free. 

5 Other Theories 

Einstein’s theory of general relativity is a yet more gen- 
eral approach, which admits transformations between 
any coordinate systems on space-time (not simply 
inertial frames). This general formulation now applies 
to an arbitrary coordinate system, such as rotating 
coordinates (resolving the paradoxes implicit in New- 
ton’s bucket problem). This is formulated on a possi- 
bly curved, pseudo-Riemannian manifold where local 
neighborhoods of space-time events are equivalent to 
Minkowski space and free particles follow geodesics 
on the manifold. The geometry of the manifold- 
expressed by the Einstein curvature tensor— is pro- 
portional, by EINSTEIN’S FIELD EQUATIONS [III. 10], to 
the stress-energy-momentum tensor, and gravitational 
forces are fictitious forces as freely falling particles 


follow curved space-time geodesics. The symmetries 
of general relativistic space-time manifolds are much 
more complicated than those for Minkowski space but 
are still formulated in terms of Killing vector fields that 
generate transformations that keep equations invari- 
ant. 

In quantum physics (in a Galilean or special rela- 
tivistic setting), the ideas described in this article are 
very important in quantum field theory, required to 
describe the quantum nature of electromagnetic and 
other fields, and systems involving many quantum par- 
ticles. In quantum field theory, all energies in the sys- 
tem are quantized, often around a minimum-energy 
configuration. However, many held theories involve 
potentials whose minimum energy is not the most sym- 
metric choice of origin; for instance, a “Mexican hat 
potential” of the form | v | 4 - 2a\v\ 2 for some held 
vector v and constant a > 0 is rotationally symmet- 
ric around v = 0 but has a minimum for any v with 
|v I = -/a. A minimum-energy excitation of the sys- 
tem breaks this symmetry by choosing an appropriate 
minimum energy v. Many phenomena in the quantum 
theory of condensed matter (such as the Meissner effect 
in superconductors) and fundamental particles (such 
as the Higgs boson) arise from this type of symmetry 
breaking. 

In classical mechanics, the time direction is privi- 
leged (it is common to all observers), so its space- 
time structure cannot be phrased simply in terms of 
manifolds. Nevertheless, the Galilean group of trans- 
formations naturally has a hber bundle structure, with 
the base space given by the one-dimensional Euclidean 
line E 1 describing time and the hber being the spatial 
coordinates given by E 3 . 

Galilean transformations can naturally be built into 
this hber bundle structure, known as Newton-Cartan 
space-time, by defining the paths of free inertial parti- 
cles (i.e., straight lines) in the connections of the bun- 
dle. This provides a useful mathematical framework 
to compare the physical laws and symmetries of New- 
tonian and relativity theories, and interesting connec- 
tions exist between general relativity and Newtonian 
gravity in the Newton-Cartan formalism. 
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11.22 The Jordan Canonical Form 

Nicholas J. Higham 


A canonical form for a class of matrices is a form of 
matrix— usually chosen to be as simple as possible— 
to which all members of the class can be reduced by 
transformations of a specified kind. The Jordan canon- 
ical form (JCF) is associated with similarity transforma- 
tions on n xn matrices. A similarity transformation of 
a matrix A is a transformation from A to X^ 1 AX, where 
X is nonsingular. The JCF is the simplest form that can 
be achieved by similarity transformations, in the sense 
that it is the closest to a diagonal matrix. 

The JCF of a complex nxn matrix A can be writ- 
ten A = ZJZ -1 , where Z is nonsingular and the Jordan 
matrix J is a block-diagonal matrix 

Ji 

h 


Jp 


with diagonal blocks of the form 

TAfc 1 


Jfc = Jfc(A k ) 




1 

At 


Here, blanks denote zero blocks or zero entries. The 
matrix J is unique up to permutation of the diago- 
nal blocks, but Z is not. Each Ak is an eigenvalue of 
A and may appear in more than one Jordan block. All 
the eigenvalues [1.2 §20] of the Jordan block Jk are 
equal to At. By definition, an eigenvector of Jk is a 
nonzero vector x satisfying JkX = A tx, and all such x 
are nonzero multiples of the vector x = [1 0 ■ ■ ■ 0] T . 
Therefore Jk has only one linearly independent eigen- 
vector. Expand x to a vector x with n components by 
padding it with zeros in positions corresponding to 
each of the other Jordan blocks Ji, i J= k. The vec- 
tor x has a single 1, in the rth component, say. A 
corresponding eigenvector of A is Zx, since A(Zx) = 
ZJZ~ l {Zx) = ZJx = A kZx\ this eigenvector is the rth 
column of Z. 


If every block Jk is 1 x 1 then J is diagonal and A is 
similar to a diagonal matrix; such matrices A are called 
diagonalizable. For example, real symmetric matrices 
are diagonalizable— and moreover the eigenvalues are 
real and the matrix Z in the JCF can be taken to be 
orthogonal. A matrix that is not diagonalizable is defec- 
tive', such matrices do not have a complete set of lin- 
early independent eigenvectors or, equivalently, their 
Jordan form has at least one block of dimension 2 or 
greater. 

To give a specific example, the matrix 


A = 



1 1 
1 0 
0 1 


has a JCF with 



" 0 | r 


r 1 

2 

0 0 

z = 

-1 -§ 0 

J = 

0 

1 1 


O 

O 

i-H 


0 

0 1 


( 1 ) 


As the partitioning of J indicates, there are two Jor- 
dan blocks: a 1 x 1 block with eigenvalue | and a 2 x 2 
block with eigenvalue 1. The eigenvalue * of A has an 
associated eigenvector equal to the first column of Z. 
For the double eigenvalue 1 there is only one linearly 
independent eigenvector, namely the second column, 
Z2, of Z. The third column, Z3, of Z is a generalized 
eigenvector: it satisfies AZ3 = Z2 + Z3. 

The JCF provides complete information about the 
eigensystem. The geometric multiplicity of an eigen- 
value, defined as the number of associated linearly 
independent eigenvectors, is the number of Jordan 
blocks in which that eigenvalue appears. The algebraic 
multiplicity of an eigenvalue, defined as its multiplic- 
ity as a zero of the characteristic polynomial g(t) = 
det (tl - A), is the number of copies of the eigenvalue 
among all the Jordan blocks. For the matrix (1) above, 
the geometric multiplicity of the eigenvalue 1 is 1 and 
the algebraic multiplicity is 2, while the eigenvalue | 
has geometric and algebraic multiplicities both equal 
to 1. 

The minimal polynomial of a matrix is the unique 
monic polynomial l/j of lowest degree such that ip(A) = 
0. The degree of 1 p is certainly no larger than n 
because the cayley-hamilton theorem [IV.10 §5.3] 
states that q(A) = 0. The minimal polynomial of an 
mx m Jordan block Jk(Ak) is (t - \k) m - The minimal 
polynomial of A is therefore given by 


V(t) = f[ (t-Ai) mi , 

i = 1 
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where Ai, . . . , A 5 are the distinct eigenvalues of A and 
mi is the dimension of the largest Jordan block in 
which A; appears. An nx n matrix is derogatory if 
the minimal polynomial has degree less than n. This is 
equivalent to some eigenvalue appearing in more than 
one Jordan block. The matrix A in (1) is defective but 
not derogatory. The nx n identity matrix is derogatory 
for n > 1: it has characteristic polynomial (t - 1)” and 
minimal polynomial t — 1. 

Two questions that arise in many situations are, “Do 
the powers of the matrix A converge to zero?” and “Are 
the powers of A bounded?” The answers to both ques- 
tions are easily obtained using the JCF. If A = ZJZ~ k 
then A 2 = ZJZ ■ Z/Z -1 = ZJ 2 Z~ l and, in general, 
A k = ZJ k Z _1 . Therefore the powers of A converge to 
zero precisely when the powers of J converge to zero, 
and this in turn holds when the powers of each indi- 
vidual Jordan block converge to zero. The powers of 
a 1 x 1 Jordan block Ji = (A;) obviously converge to 
zero when |A;| < 1. In general, since Ji(\i) k has diago- 
nal elements \ k , for the powers of Jk(^k) to converge 
to zero it is necessary that | A* | <1, and this condition 
turns out to be sufficient. Therefore A k — 0 as k — • co 
precisely when p( A) < 1, where p is the spectral radius, 
defined as the largest absolute value of any eigenvalue 
of A. 

Turning to the question of whether the powers of 
A are bounded, by the argument in the previous para- 
graph it suffices to consider an individual Jordan block. 
The powers of Jk(^k) are clearly bounded when | Ajtl < 
1, as we have just seen, and unbounded when | Ajtl > 1. 
When | Afc | = 1 the powers are bounded if the block 
is 1 x 1, but they are unbounded for larger blocks. For 
example, [ J } = [ J \ ] , which is unbounded as k — co. 

The conclusion is that the powers of A are bounded as 
long as p(A) ^ 1 and any eigenvalues of modulus 1 are 
in Jordan blocks of size 1. Thus the powers of A in (1) 
are not bounded. 

In one sense, defective matrices— those with nontriv- 
ial Jordan structure — are very rare because the diago- 
nalizable matrices are dense in the set of all matrices. 
Therefore if you generate matrices randomly you will 
be very unlikely to generate one that is not diagonal- 
izable (this is true even if you generate matrices with 
random integer entries). But in another sense, defective 
matrices are quite common. Certain types of bifurca- 
tions [IV.21] in dynamical systems are characterized 
by the presence of nontrivial Jordan blocks in the Jaco- 
bian matrix, while in problems where some function of 


the eigenvalues of a matrix is optimized the optimum 
often occurs at a defective matrix. 

While the JCF provides understanding of a variety of 
matrix problems, it is not suitable as a computational 
tool. The JCF is not a continuous function of the entries 
of the matrix and can be very sensitive to perturbations. 
For example, for e ± 0, [ J f ] (one Jordan block) and 
to?] ^ two Jordan blocks) have different Jordan struc- 
tures, even though the matrices can be made arbitrar- 
ily close by taking e sufficiently small. In practice, it 
is very difficult to compute the JCF in floating-point 
arithmetic due to the unavoidable perturbations caused 
by rounding errors. As a general principle, the schur 
decomposition [IV. 10 §5.5] is preferred for practical 
computations. 


11.23 Krylov Subspaces 

Valeria Simoncini 


1 Definition and Properties 

The mth Krylov subspace of the matrix A e C nxn and 
the vector v e C n is 

X m (A,v) = span{v,Av, . . . ,A m ~ 1 v}. 

The dimension of X m (A,v) is at most m, and it is 
less if an invariant subspace of A with respect to v is 
obtained for some m* < m. In general, X m (A,v) s 
X m +i (A, v) (the spaces are nested); if m* = n, then 
X n (A, v) spans the whole of C”. 

Let v\ = v/ 1| v || 2 , with \\vW2 = {v*v) 1/2 the 2- 
norm, and let {vi, V2, ■ ■ . , v m ] be an orthonormal basis 
of X m i A, v). Setting V m = [v\,V2,- ■ ■ ,v m ], from the 
nesting property it follows that the next basis vec- 
tor v m+ i can be computed by the following Arnoldi 
relation: 

A~V m 

— \^nu Vm + l]Hm + l,nu 

where H m+ y m e c (m+1)xm is an upper Hessenberg 
matrix (upper triangular plus nonzero entries imme- 
diately below the diagonal) whose columns contain the 
coefficients that make v m+ i orthogonal to the already 
available basis vectors vi,...,v m . 

Suppose we wish to approximate a vector y by a vec- 
tor x £ X m (A,v), measuring error in the 2-norm. Any 
such x can be written as a polynomial in A of degree 
at most m - 1 times v: x = Ei=o o>-iA l v. If A is Her- 
mitian, then by using a spectral decomposition we can 
reduce A to diagonal form by unitary transformations, 
which do not change the 2-norm, and it is then clear 
that the eigenvalues of A and the decomposition of v 
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in terms of the eigenvectors of A drive the approxima- 
tion properties of the space. For non-Hermitian A the 
approximation error is harder to analyze, especially for 
highly nonnormal or nondiagonalizable matrices. 

By replacing v by an n x s matrix V, with 5^1, 
spaces of dimension at most ms can be obtained. An 
immediate matrix counterpart is 

K m (A, V) = I X yi A 'V: Yi £ C for all il. 

^ i=0 J 

A richer version is obtained by working with linear 
combinations of all the available vectors: 

K°(A,V) = range([V,AV A m ~ 1 V]). 

Methods based on this latter space are called “block” 
methods, since all matrix structure properties are gen- 
eralized to blocks (e.g., H m+ i <m will be block upper 
Hessenberg, with 5X5 blocks). Block spaces are appro- 
priate, for instance, in the presence of multiple eigen- 
values or if the original application requires using the 
same A and different vectors v. 

2 Applications and Generalizations 

Krylov subspaces are used in projection methods for 
solving large algebraic linear systems, eigenvalue prob- 
lems, and matrix equations; for approximating a wide 
range of matrix functions (analytic functions, trace, 
determinant, transfer functions, etc.); and in model 
order reduction. 

The general idea is to project the original problem of 
size n onto the Krylov subspace of dimension m «r n 
and then solve the smaller mx m reduced problem 
with a more direct method (one that would be too 
computationally expensive if applied to the original 
n xn problem). If the Krylov subspace is good enough, 
then the projected problem retains sufficient informa- 
tion from the original problem that the sought after 
quantities are well approximated. 

When equations are involved, Krylov subspaces usu- 
ally play a role as approximation spaces, as well as 
test spaces. The actual test space used determines 
the resulting method and influences the convergence 
properties. 

Generalized spaces have emerged as second-gener- 
ation Krylov subspaces. In the eigenvalue context, the 
“shift-and-invert” Krylov subspace X m ({A - crl)~ 1 ,v) 
is able to efficiently approximate eigenvalues in a neigh- 
borhood of a fixed scalar cr e C; here I is the iden- 
tity matrix of size n. In matrix function evaluations 
and matrix equations, the extended space X m (A , v) + 


X m (A~ l , A _1 v) has shown some advantages over the 
classical space, while for oi , . . . , cr m e C, the use of the 
more general rational space 

span{(A - o"i J) 1 u , . . . , (A - cr m J) -1 v} 

has recently received a lot of attention for its potential 
in a variety of advanced applications beyond eigenvalue 
problems, where it was first introduced in the 1980s. All 
these generalized spaces require solving systems with 
some shifted forms of A, so that they are in general 
more expensive to build than the classical one, depend- 
ing on the computational cost involved in solving these 
systems. However, the computed space is usually richer 
in spectral information, so that a much smaller space 
dimension is required to satisfactorily approximate the 
requested quantities. The choice among these variants 
thus depends on the spectral and sparsity properties 
of the matrix A. 

Further Reading 

Freund, R. W. 2003. Model reduction methods based on 
Krylov subspaces. Acta Numerica 12:126-32. 

Liesen, J., and Z. Strakos. 2013. Krylov Subspace Methods: 

Principles and Analysis. Oxford: Oxford University Press. 
Saad, Y. 2003. Iterative Methods for Sparse Linear Systems, 
2nd edn. Philadelphia, PA: SIAM. 

Watkins, D. S. 2007. The Matrix Eigenvalue Problem: GR and 
Krylov Subspace Methods. Philadelphia, PA: SIAM. 


11.24 The Level Set Method 

Fadil Santosa 


1 The Basic Idea 

The level set method is a numerical method for repre- 
senting a closed curve or surface. It uses an implicit rep- 
resentation of the geometrical object in question. It has 
found widespread use in problems where the closed 
curve evolves or needs to be recomputed often. A main 
advantage of the method is that such a representation 
is very flexible and calculation can be done on a regular 
grid. In computations where surfaces evolve, changes in 
the topology of the surface are easily handled. 

Consider an example in two dimensions in the (x,y)- 
plane. Suppose one is interested in the motion of a 
curve under external forcing terms. Let C(t) denote the 
curve as a function of time t. One method for solv- 
ing this problem is to track the curve, which can be 
done by choosing marker points, (x;(f) , yt(t)) s C(t), 
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Figure 1 The graphs of a level set function z = cp(x,y,t) 
for three values of t are shown at the bottom of the figure. 
The plane z = 0 intersects these functions. The domains 
D(t) = {(x,y ) : qp(x,y,t) > 0} are shown above. Note that, 
in this example, D(t) has gone through a topological change 
as t varies. 

i = 1, . . . , n, whose motions are determined by the forc- 
ing. The curve itself may be recovered at any time t by 
a prescribed spline interpolation. 

The level set method takes a different approach; it 
represents the curve as the zero level set of a function 
qp(x,y, t ). That is, the curve is given by 

C(f) = {( x,y ): qp(x,y,t ) = 0}. 

One can set up the function so that the interior of the 
curve C(t) is the set 

D(t) = {(x,y)\ qp(x,y, t) > 0}. 

In figure 1 the level set function z = cp(x,y, t) can be 
seen to intersect the plane z = 0 at various times t. The 
sets of D(t) (not up to the same scale) are shown above 
each three-dimensional figure. 

An advantage of the level set method is demon- 
strated in the figure. One can see that a topological 
change in D(t) has occurred as t is varied. The level 
set method allows for such a change without the need 
to redefine the representation, as would be the case for 
the front-tracking method described previously. 

2 Discretization 

One of the attractive features of the level set method 
is that calculations are done on a regular Cartesian 
grid. Suppose we have discretized the computational 
domain and the nodes are at coordinates ( Xj.j/j ) for 
i = 1, . . . , m and j = 1, ... ,n. The values of the level set 
function cp(x, y, t) are then stored at coordinate points 
x = Xi and y = yj. At any time, if one is interested in 
the curve C(t), the zero level set, the set 

C(t) = {(x,y): cp(x,y, t) = 0} 


needs to be approximated from the data qp(Xi,yj,t). 
One is typically interested in such quantities as the 
normal to the curve and the curvature at a point on 
the curve. These quantities are easily calculated by 
evaluating finite-difference approximations of 

Vc p 

v = Wv\ 

and 

k = V ■ v, 

where the gradient operator V = [d/dx 3/3y] T . 

In practice, it is not necessary to keep all values of the 
level set function on the nodes. Since one is often inter- 
ested only in the motion of the curve C(f ) , the zero level 
set, one needs only the values of the level set function in 
the neighborhood of the curve. Such approaches have 
been dubbed “narrow-band methods” and can poten- 
tially reduce the amount of computation in a problem 
involving complex evolution of surfaces. 

It must be noted that, in the two-dimensional exam- 
ple here, C(t) is a one-dimensional object, whereas 
the level set function qp{x,y,t) is a two-dimensional 
function. Thus, one might say that the ability to track 
topological changes is made at the cost of increased 
computational complexity. 

3 Applications 

A simple problem one may pose is that of tracking the 
motion of a curve for which every point on the curve 
is moving in the direction normal to the curve with a 
given velocity. If the velocity is v, then the equation for 
the level sets is given by 



If one is interested in tracking the motion of the zero 
level set C ( t ) , then one must specify an initial condition 

qp(x,y,0) = qp 0 (x,y), 
where the initial zero level set is given by 
C( 0) = { (x , v) : qpo(x,y) = 0}. 

This evolution of such a curve may be very compli- 
cated and go through topological changes. The power 
of the level set method is demonstrated here because 
all one needs to do is solve the initial-value problem for 
qp(x,y,t). 

Another simple problem is the classical motion by 
mean curvature. In this “flow,” one is interested in 
tracking the motion of a curve for which every point on 
the curve is moving normal to the curve at a velocity 


116 


II. Concepts 


proportional to the curvature. The evolution equation 

is given by _ 

3cp = Vc y 

d t ' | Vc p \ ' 

A numerical solution of this evolution equation can 
be used to demonstrate the classical Grayson theo- 
rem, which asserts that, if the closed curve starts 
out without self-intersections, then it will never form 
self-intersections and it will become convex in finite 
time. 

Significant problems arising from applications from 
diverse fields have benefited from the level set treat- 
ment. The following is an incomplete list meant to give 
a sense of the range of applications. 

Image processing. The level set method can be used 
for segmentation of objects in a two-dimensional 
scene. It has also been demonstrated to be effective 
in modeling surfaces from point clouds. 

Fluid dynamics. Two-phase flows, which involve inter- 
faces separating the two phases, can be approached 
by the level set method. It is particularly effective for 
problems in which one of the phases is dispersed in 
bubbles. 

Inverse problems. Inverse problems exist in which the 
unknown that one wishes to reconstruct from data is 
the boundary of an object. Examples include inverse 
scattering. 

Optimal shape design. When the object is to design 
a shape that maximizes certain attributes (design 
objectives), it is often very convenient to represent 
the shape by a level set function. 

Computer animation. The need for physically based 
simulations in the animation industry has been par- 
tially met by solving equations of physics using 
the level set method to represent surfaces that are 
involved in the simulation. 

Current research areas include improved accuracy in 
the numerical schemes employed and in applying the 
method to ever more complex physics. 

Further Reading 

Osher, S., and R. Fedkiw. 2003. Level Set Methods and 
Dynamic Implicit Surfaces. New York: Springer. 

Osher, S., and J. A. Sethian. 1988. Fronts propagating 
with curvature-dependent speed: algorithms based on 
Hamilton-Jacobi formulations. Journal of Computational 
Physics 79:12-49. 

Sethian, J. A. 1999. Level Set Methods and Fast Marching 
Methods: Evolving Interfaces in Computational Geometry, 
Fluid Mechanics, Computer Vision, and Materials Science. 
Cambridge: Cambridge University Press. 


11.25 Markov Chains 

Beatrice Meini 


A Markov chain is a type of random process whose 
behavior in the future is influenced only by its current 
state and not by what happened in the past. A simple 
example is a random walk on Z, where a particle moves 
among the integers of the real line and is allowed to 
move one step forward with probability 0 < p < 1 and 
one step backward with probability q = 1 - p (see fig- 
ure 1). The position of the particle at time n + 1 (in the 
future) depends on the position of the particle at time n 
(at the present time), and what happened before time n 
(in the past) is irrelevant. 

To give a precise definition of a Markov chain we will 
need some notation. Let £ be a countable set represent- 
ing the states, and let Q be a set that represents the 
sample space. Let X, Y : Q — £ be two random vari- 
ables. We denote by PfA" = j] the probability that X 
takes the value j, and we denote by PfX = j \ Y = i] 
the probability that X takes the value j given that the 
random variable Y takes the value i. A discrete stochas- 
tic process is a family (A^lne^ of random variables 
X n : Q — £. 

A stochastic process (A'nJnefj is called a Markov 
chain if 

^[Xfi + l — tn + 1 I Xq = lQ, . . . , X n = i n ] 

= P[An + i = In + l I X n = in J 

at any time n ^ 0 and for any states io, ■ ■ ■ , in+ 1 £ £■ 
This means that the state X n at time n is sufficient to 
determine which state X n+ i might be occupied at time 
n + 1, and we may forget the past history Xo, . . . ,X n -i. 

It is often required that the laws that govern the 
evolution of the system be time invariant. The Markov 
chain is said to be homogeneous if the transitions from 
one state to another state are independent of the time 
n, i.e., if 

P[An + l = j I X n i i pij 

at any time n 0 and for any states i, j e £. The num- 
ber pij represents the probability of passing from state 
i to state j in one time step. The matrix P = (Pij)ijsE 
is called the transition matrix of the Markov chain. The 
matrix £ is a stochastic matrix : that is, it has nonneg- 
ative entries and unit row sums C£.jG e Pij = 1 for all 
i e £). The dynamic behavior of the Markov chain 
is governed by the transition matrix P. In particular, 
the problem of computing the probability that, after n 
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Figure 1 A random walk on 


steps, the Markov chain is in a given state reduces to 
calculating the entries of the nth power of P, since 


P[A n = j I A'o = i] = ( P n )ij , 


where (P n )ij denotes the (i,j) entry of P n . 

In the random walk on Z, the set of states is E = 
Z, and the random variable X n is the position of the 
particle at time n. The stochastic process (An) nG N is a 
homogeneous Markov chain. The transition matrix P is 
tridiagonal, with pi,t+ 1 = p, Pi.i-i = R, and pij = 0 for 

j ± i~ l,i+ 1. 

In many applications, the interest is in the asymp- 
totic behavior of the Markov chain. In particular, the 
question is to understand if lim n -oo P [X n = j | A'o = i] 
exists and if that limit is independent of the initial state 
i. If such a limit exists and is equal to Ttj > 0 for any 
initial state i, then the vector tt = (jtj)j G r is such that 
ttj = 1 and tt t P = 7 t t . This vector n is called 
the steady state vector , or the probability invariant vec- 
tor. If £ is a finite set, then the steady state vector is a 
left eigenvector of P corresponding to the eigenvalue 1; 
moreover, if in addition the matrix P is irreducible, then 
there exists a unique steady state vector. 

The matrix 


l 

L 2 


1 

4 

1 

3 

1 

2 J 


(1) 


is the transition matrix of a Markov chain with space 
state £ = {1,2,3}. The transitions among the states can 
be represented by the graph of figure 2. The powers of 
P converge: 


lim P n 

n— oo 


- 1 
4 
1 
4 
1 
_ 4 


3 

8 

3 

8 

3 

8 


3 “I 
8 
3 
8 
3 

8 _ 


Hence the steady state vector is 7T T = [ | | ] . A simple 

computation shows that tt t = tt t £. 

Markov chains have applications across a wide range 
of topics in different areas, including mathematical 
biology, chemistry, queueing theory, information sci- 
ences, economics and finance, Internet applications, 
and more. According to the model, the transition matrix 
P can be finite or infinite dimensional and can have 


l 



Figure 2 Transitions of the Markov chain having the 
matrix P of equation (1) as transition matrix. 


specific structures, like sparsity or pattern structure. In 
most of the applications of Markov chains, the interest 
is in the computation of the steady state vector, assum- 
ing that it exists. In this regard, specific algorithms can 
be designed according to the properties of the matrix P. 

Further Reading 

Stewart (1994) is an introduction to numerical meth- 
ods for general Markov chains, while Bini et al. (2005) 
presents specific algorithms for Markov chains aris- 
ing in queueing models. Norris (1999) gives a complete 
treatise on Markov chains. 

Bini, D. A., G. Latouche, and B. Meini. 2005. Numerical 
Methods for Structured Markov Chains. New York: Oxford 
University Press. 

Norris, J. R. 1999. Markov Chains, 3rd edn. Cambridge: 
Cambridge University Press. 

Stewart, W. J. 1994. Introduction to the Numerical Solution of 
Markov Chains. Princeton, NJ: Princeton University Press. 


11.26 Model Reduction 

Peter Benner 


“Model reduction” is an ambiguous term; in this arti- 
cle it is understood to mean the reduction of the com- 
plexity of a mathematical model by (semi-) automatic 
mathematical algorithms. Often such techniques are 
also called “dimension reduction,” “order reduction,” 
or, inspired by system-theoretic terminology, “model 
order reduction.” The concept is used in various appli- 
cation areas and in different contexts. Model order 
reduction has emerged in many disciplines, primarily in 
structural dynamics, systems and control theory, com- 
putational fluid mechanics, chemical process engineer- 
ing, and, more recently, circuit simulation, microelec- 
tromechanical systems, and computational electromag- 
netics. It now also finds its way into numerous other 
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application areas such as image processing, compu- 
tational neuroscience, and computational/computer- 
aided engineering in general. 

Here, we will focus on the reduction of the dimension 
of the state space of a dynamical system, for which we 
use the following model equation: 

E(t, p)x(t) = f(t,x(t), u(t),p), (la) 

x(t 0 ) = xq, (lb) 

y(t) = g(t,x(t),u(t),p). (lc) 

In this set of equations, x(t ) e R n is the state of 
the system at time t e [to, T] (to < T ^ oo), u(t) e 
R m denotes inputs (control, time-varying parameters), 
p G fl c R d is a vector of stationary (material, geom- 
etry, design, etc.) parameters, Q is usually a bounded 
domain, and xo £ R n is the initial state of the sys- 
tem. The matrix E(t, p) £ R MX " determines the nature 
of the system. When it is uniformly nonsingular, (la) 
represents a system of ordinary differential equations; 
otherwise (1) is a descriptor system. If d > 0, i.e., when 
parameters are present, one may also consider the case 
E(t,p) = 0. Then (la) becomes a system of (nonlin- 
ear) algebraic equations. Model reduction preserving 
the parameters as symbolic quantities will then acceler- 
ate the approximate solution of (la) in the case of vary- 
ing parameters (the “many-query context”). The equa- 
tion (lc) describes an output vector y(t) e R'T It may 
correspond to a practical setting where only a few mea- 
surements of the system or observables are available. 
This equation may also be used to identify quantities 
of interest when full state information is not required 
in the application. If full state information is needed, 
one simply sets y(t) = x(t). The functions / and g 
are assumed to have sufficient smoothness properties, 
where for /, Lipschitz continuity is usually a minimum 
requirement for ensuring existence and uniqueness of 
local solutions of (la). 

The goal of model reduction is to replace (1) by a 
system of reduced state-space dimension r « n of the 
same form, 

E(t, p)x(t) = f(t,x(t),u(t),p), (2a) 

x(to) = xo, (2b) 

y(t) = g(t,x(t), u(t), p), (2c) 

with the inputs u(t ) and the parameter vector p 
unchanged from (1), such that y matches y as closely 
as possible for all admissible control inputs and param- 
eters. Additionally, one may require the preservation 


of structural properties like stability, passivity, dissi- 
pativity, etc. It should be clear that it is difficult to ful- 
fill all these demands at once, and various model order 
reduction methods for specific applications have there- 
fore been developed. A common principle can be found 
in many of these, and the mathematical core of the 
methods is often very similar. 

1 The Basic Concept 

The basic principle behind most model order reduc- 
tion methods is the projection of the state equation (la) 
onto a low-dimensional subset VcR", possibly along 
a complementary subspace TV of the same dimension. 
The projection onto nonlinear subsets like central or 
(approximate) inertial manifolds is the topic of the 
theory of dynamical systems and has been applied so 
far mainly in reaction kinetics, process engineering, 
and systems biology. Most other successful families 
of methods— such as modal truncation , balanced trun- 
cation, Pade approximation/Krydov subspace methods/ 
moment matching (which are all instances of rational 
interpolation), and proper orthogonal decomposition 
and reduced basis methods— use linear projection sub- 
spaces, and they can mostly be categorized as (Petrov-) 
Galerkin projection methods. Suppose we are given 
an orthogonal basis of V, represented by the column 
space of V G R raxr , and a basis of TV forming the col- 
umn space of IV G R nxr (where TV = V and W = V in 
the Galerkin case) such that W T V = I r (the r xr iden- 
tity matrix). The reduced-order model is then obtained 
by setting x = W T x, xo = tV T xo, and making the 
residual of (la) for x orthogonal to TV: 

TV ± E(t, p)x - f(t,x, u, p) V admissible t, u, p 

<=> 0 = W T (E(t,p)x - f(t,x,u,p)). 

The different methods now mainly differ in the way 
V (and W) are computed. The full state x of (1) is 
approximated by x = Vx(t) = VlV T x(t). 

As an example, consider a linear time-invariant sys- 
tem without parameters: 

E(t, p) = E, (3a) 

f(t,x,u,p) = Ax + Bu, (3b) 

g(t,x,u,p) = Cx, (3c) 

with A,E e R nx ", B e R” xm , C e R^ xn . For instance, 
interpolation methods (and also balanced truncation) 
utilize the transfer function 

G(s) = C(sE - Ai^B, 5GC, 
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of (1), which in the linear time-invariant case is obtained 
by taking Laplace transforms and inserting the trans- 
formed equation (la) into the transformed (lc). The 
transfer function represents the mapping of inputs u 
to outputs y. As a rational matrix-valued function of 
a complex variable, it can be approximated in differ- 
ent ways. In rational interpolation methods, V, W are 
computed so that 

£g<*) = ^-iGis k ), k = l,...,K, j = 0,...,J k , 
d\t d5t 

for K interpolation points Sk and derivatives up to order 
Jk at each point. Here, G denotes the transfer function 
of (2) and (3), defined by A = W T AV, B = W T B, and 

c = cv. 

In the nonlinear case, a further question is how to 
obtain functions f and g allowing for fast evaluation. 
Simply setting f(t,x, u, p) = W T f(t,Vx,u,p) obvi- 
ously does not lead to faster simulation in general. 
Therefore, dedicated methods, such as (discrete) empir- 
ical interpolation, are needed to obtain a “reduced” f 
and g. 

2 An Example 

As an example consider the mathematical model of a 
microgyroscope: a device used in stability control of 
vehicles. Finite-element discretization of this particu- 
lar model leads to a linear time-invariant system of 
n = 34 722 linear ordinary differential equations with 
d = 4 parameters, m = 1 input, and g = 12 outputs. 
Using a reduced-order model of size r = 289, a param- 
eter study involving two parameters (defining x- and 
y-axes in figure 1) and the excitation frequency to (i.e., 
the parametric transfer function G(s,p) is evaluated 
for 5 = ito with varying to) could be accelerated by a 
factor of approximately 90 without significant loss of 
accuracy. The output y was computed with an error 
of less than 0.01% in the whole frequency and param- 
eter domain. Figure 1 shows the response surfaces of 
the full and reduced-order models at one frequency for 
variations of two parameters. 

Further Reading 

Antoulas, A. 2005. Approximation of Large-Scale Dynamical 
Systems. Philadelphia, PA: SIAM. 

Benner, P., M. Hinze, and E. J. W. ter Maten, eds. 2011. 
Model Reduction for Circuit Simulation. Lecture Notes in 
Electrical Engineering, volume 74. Dordrecht: Springer. 



Figure 1 The parametric transfer function of a microde- 
vice (at co = 0.025): results from (a) the full model with 
dimension 34 722 and (b) the reduced-order model with 
dimension 289. (Computations and graphics by L. Feng and 
T. Breiten.) 

Benner, P., V. Mehrmann, and D. C. Sorensen, eds. 2005. 
Dimension Reduction of Large-Scale Systems. Lecture 
Notes in Computational Science and Engineering, vol- 
ume 45. Berlin: Springer. 

Schilders, W. H. A., H. A. van der Vorst, and J. Rommes, eds. 
2008. Model Order Reduction: Theory, Research Aspects 
and Applications. Mathematics in Industry, volume 13. 
Berlin: Springer. 


11.27 Multiscale Modeling 

Fadil Santosa 


To accurately model physical, biological, and other 
phenomena, one is often confronted with the need 
to capture complex interactions occurring at distinct 
temporal and spatial scales. In the language of multi- 
scale modeling, temporal scales are usually differenti- 
ated by slow, medium, and fast timescales. Spatially, 
the phenomena are separated into micro-, meso-, and 
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macroscales. In modeling the deformation of solids, for 
instance, the microscale phenomena could be atomistic 
interactions occurring on a femtosecond (1CT 15 sec- 
ond) timescale. At the mesoscopic scale, one could be 
interested in the behavior of the constituent macro- 
molecules, e.g., a tangled bundle of polymers. Finally, 
at the macroscopic scale, one might be interested in 
how a body, whose size could be in meters, deforms 
under an applied force. The challenge in multiscale 
modeling is that the interactions at one scale com- 
municate with interactions at other scales. Thus, in 
the example given, the question we wish to answer is 
how the applied forces affect the atomistic interactions, 
and how those interactions impact the behavior of the 
macromolecules, which in turn affects how the overall 
shape of the body deforms. 

Multiscale modeling is a rapidly developing field 
because of its enormous importance in applications. 
The range of applications is staggering. It has been 
applied in geophysics, biology, chemistry, meteorology, 
materials science, and physics. 

We give another concrete example that arises in solid 
mechanics. Suppose we have a block of pure aluminum 
whose crystalline structure is known. How can we cal- 
culate its elastic properties, i.e., its Lame modulus and 
Poisson ratio, ab initio from knowledge of its atomistic 
structure? Such a calculation would start by consid- 
ering Schrodinger’s equation for the multiparticle sys- 
tem. By solving for the ground states of the system, one 
can then extract the desired macroscopic properties of 
the bulk aluminum. 

A classical example of multiscale modeling in applied 
mathematics is the homogenization method [11.17], 
which allows for extraction of effective properties of 
composite materials. Consider the steady-state distri- 
bution of temperature in a rod of length £ made up of a 
material v\ith rapidly oscillating conductivity. The con- 
ductivity is described by a periodic function a(y ) > 0, 
such that a(y + 1) = a(y). A small-scale e is intro- 
duced to denote the actual period in the medium. The 
governing equation for temperature u(x) is 

jir'j = /, 0 < x < T, 

where a prime denotes differentiation with respect 
to x. Here, / is the heat source distribution, with x 
measuring distance along the rod. To solve the prob- 
lem, the solution u is developed in powers of e. The 
macroscopic behavior of u is identified with the zeroth 
order. This solution will be smooth as the small rapid 
oscillations are ignored. 


Current research in multiscale modeling focuses 
on bridging the phenomena at the different scales 
and developing efficient numerical methods. There are 
efforts to develop rigorous multiscale models that 
agree with their continuum counterparts, continuum 
models [IV.26] are macroscale models derived from 
first principles and where the material properties are 
usually measured. Other efforts concentrate more on 
developing accurate simulations, such as modeling the 
properties of Kevlar starting from the polymers in the 
resin and the carbon fibers used. All research in this 
area involves some numerical analysis and scientific 
computing. 

11.28 Nonlinear Equations and 
Newton’s Method 

Marcos Raydan 

Nonlinear equations appear frequently in the mathe- 
matical modeling of real-world processes. They are usu- 
ally written as a zero-finding problem: find Xj e R, for 
j = 1,2 ,...,n, such that 

/;(xi,...,x re ) = 0 fori =1,2 n, 

where the /, are given functions of n variables. This 
system of equations is nonlinear if at least one of the 
functions fi depends nonlinearly on at least one of the 
variables. Using vector notation, the problem can also 
be written as find x = [xi, . . . ,x„] T £l" such that 

F(x) = [/i(x) f n (x)] T = 0. 

If every function /; depends linearly on all the vari- 
ables, then it is usually written as a linear system of 
equations Ax = b, where b £l" and A is an n x n 
matrix. 

The existence and uniqueness of solutions for non- 
linear systems of equations is more complicated than 
for linear systems of equations. For solving Ax = b , 
the number of solutions must be either zero, infinity, or 
one (when A is nonsingular), whereas F(x) = 0 can have 
zero, infinitely many, or any finite number of solutions. 
Fortunately, in practice, it is usually sufficient to find a 
solution of the nonlinear system for which a reasonable 
initial approximation is known. 

Even in the simple one-dimensional case (n = 1), 
most nonlinear equations cannot be solved by a closed 
formula, i.e., using a finite number of operations. A 
well-known exception is the problem of finding the 
roots of polynomials of degree less than or equal to 
four, for which closed formulas have been known for 
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Figure 1 One iteration of Newton’s method 
in one dimension for fix ) = 0. 


centuries. As a consequence, in general iterative meth- 
ods must be used to produce increasingly accurate 
approximations to the solution. One of the oldest iter- 
ative schemes, which has played an important role in 
the numerical methods literature for solving F(x) = 0, 
is Newton’s method. 

1 Newton’s Method 

Newton’s method for solving nonlinear equations was 
born in one dimension. In that case, the problem is find 
x e R such that fix) = 0, where /: R -* R is differ- 
entiable in the neighborhood of a solution x*. Starting 
from a given xo, on the fcth iteration Newton’s method 
constructs the tangent line passing through the point 

(Xfc,/(Xfc)), 

M k (x) = f(x k ) + f(x k )(x - x k ), 

and defines the next iterate, x k+ i, as the root of the 
equation M k (x ) = 0 (see figure 1). Hence, from a given 
xo e R, Newton’s method generates the sequence {x k } 
of approximations to x* given by 

x k +i = x k - f(x k )lf'(x k ). 

Notice that the tangent line or linear model M k (x) is 
equal to the first two terms of the Taylor series of / 
around x k . 

Newton’s idea in one dimension can be extended to 
n-dimensional problems. In R™ the method approxi- 
mates the solution of a square nonlinear system of 
equations by solving a sequence of square linear sys- 
tems. As in the one-dimensional case, on the kth iter- 
ation the idea is to define x k+ i as a zero of the linear 
model given by 

M k (x) = F(xfc) + J (x k ) (x - x k ), 
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where the map F : R n — R™ is assumed to be dif- 
ferentiable in a neighborhood of a solution x* and 
where J(x k ) is the nxn Jacobian matrix with entries 
Jij(x k ) = dfi/dxj(x k ) for 1 sj i,j ^ n. Therefore, 
starting at a given xo 6 R n , Newton’s method carries 
out for k = 1 , 2 ,... the following two steps. 

• Solve ](x k )s k = -F(x k ) for s k . 

• Set Xfc+i = x k + s k . 

Notice that Newton's method is scale invariant: if the 
method is applied to the nonlinear system AF(x) = 
0, for any nonsingular nxn matrix A, the sequence 
of iterates is identical to the ones obtained when it 
is applied to F(x) = 0. Another interesting theoret- 
ical feature is its impressively fast local convergence. 
Under some standard assumptions — namely that J (x* ) 
is nonsingular, J(x) is Lipschitz continuous in a neigh- 
borhood of x* , and the initial guess xo is sufficiently 
close to x*— the sequence \x k \ generated by Newton’s 
method converges q-quadratically to x*; i.e., there 
exist c > 0 and k > 0 such that for all k y k, 

\\x k +i - x*||< c\\x k - x* || 2 . 

Hence Newton's method is theoretically attractive, but 
it may be difficult to use in practice for various rea- 
sons, including the need to calculate the derivatives, 
the need to have a good initial guess to guarantee con- 
vergence, and the cost of solving an nxn linear system 
per iteration. 

2 Practical Variants 

If the derivatives are not available, or are too expen- 
sive to compute, they can be approximated by finite 
differences. A standard option is to approximate the 
jth column of J(x k ) by a forward difference quotient: 
(F(xk + h k ej) - F(x k ))/h k , where e, denotes the jth 
unit vector and h k > 0 is a suitable small number. 
Notice that, when using this Unite-difference variant, 
the map F needs to be evaluated n+ 1 times per itera- 
tion, once for each column of the Jacobian and one for 
the vector x k . Therefore, this variant is attractive when 
the evaluation of F is not expensive. 

Another option is to extend the well-known one- 
dimensional secant method to the n-dimensional prob- 
lem F{x) = 0. The main idea, in these so-called secant 
or quasi-newton methods [IV. 1 1 §4.2], is to generate 
not only a sequence of iterates {x k } but also a sequence 
of matrices {B k } that approximate Jix k ) and satisfy the 
secant equation B k s k -i = y k -i, where 5jt_i = x k - x k -i 
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and yk-\ = F(xk ) - F(xk-i)- In this case, an initial 
matrix Bo « J(x o) must be supplied. Clearly, infinitely 
many nx n matrices satisfy the secant equation. As a 
consequence, a wide variety of quasi-Newton methods 
(e.g., Broyden’s method) with different properties have 
been developed. 

When using Newton’s method, or any of its deriva- 
tive-free variants, a linear system needs to be solved 
at each iteration. This linear system can be solved by 
direct methods (e.g., LU or QR factorization), but if n 
is large and the Jacobian matrix has a sparse struc- 
ture, it may be preferable to use an iterative method 
(e.g., a krylov subspace method [IV. 1 0 §9]). For that, 
note that Xk can be used as the initial guess for the 
solution at iteration fc + 1. One of the important fea- 
tures of these so-called inexact variants of Newton’s 
method is that modern iterative linear solvers do not 
require explicit knowledge of the Jacobian; instead, 
they require only the matrix-vector product J ( Xk)z for 
any given vector z. This product can be approximated 
using a forward finite difference: 

J(xk)z a (F{xk + h k z) - F(xk))/h k . 

Hence, inexact variants of Newton's method are also 
suitable when derivatives are not available. In all the 
discussed variants, the local q -quadratic convergence is 
in general lost, but q -superlinear convergence can nev- 
ertheless be obtained, i.e., ||Xfc+i - x* II / II Xfc - x* || — 0. 

Finally, in general Newton’s method converges only 
locally, so it requires globalization strategies to be prac- 
tically effective. The two most popular and best-studied 
options are line searches and trust regions. In any case, 
a merit function f: R' 1 -> M + must be used to eval- 
uate the quality of all possible iterates. When solving 
F(x) = 0, the natural choice is f(x) = F (x) T F (x) . 

Further Reading 

Dennis, J. E., and R. Schnabel. 1983. Numerical Methods 
for Unconstrained Optimization and Nonlinear Equations. 
Englewood Cliffs, NJ: Prentice Hall. (Republished by SIAM 
(Philadelphia, PA) in 1996.) 

Kelley, C. T. 2003. Solving Nonlinear Equations with New- 
ton's Method. Philadelphia, PA: SIAM. 

Ypma, T. J. 1995. Historical development of the Newton- 
Raphson method. SIAM Review 3 7(4):53 1-5 1. 


11.29 Orthogonal Polynomials 


Polynomials p o(x), pi(x) where p,- has degree i, 

are orthogonal polynomials on an interval [a,b] with 


Table 1 Parameters in the three-term recurrence (1) for 
some classical orthogonal polynomials. 


Polynomial 

[a, b] 

w(x) 

a i 

b i 

Cj 

Chebyshev 

[-1,1] 

(I-* 2 )" 1 / 2 

2 

0 

1 

Legendre 

[-1,1] 

1 

2j+ 1 

0 

j 

J + l 

j+ 1 

Hermite 

(—00 , 00 ) 

-X 2 

e x 

2 

0 

2 j 

Laguerre 

[0, oo) 

p-x 

1 

2J + 1 

j 

c 

j+ 1 

j+ 1 

j+ 1 


respect to a nonnegative weight function w(x) if 

rb 

w(x)pi(x)pj(x) dx = 0, i + j, 

J a 

that is, if all distinct pairs of polynomials are orthogo- 
nal on [a, b] with respect to w. For a given weight func- 
tion and interval, the orthogonality conditions deter- 
mine the polynomials p; uniquely up to a constant 
factor. 

An important property of orthogonal polynomials is 
that they satisfy a three-term recurrence relation 

Pj+i(x) = (ajx + bj)pj(x) - Cjpj- i(x), j ^ 1. (1) 

The weight functions, interval, and recurrence coeffi- 
cients for some classical orthogonal polynomials are 
summarized in table 1 , in which is assumed the normal- 
ization po(x) = 1, with pi(x) = x for the Chebyshev 
and Legendre polynomials, pi(x) = 2x for the Her- 
mite polynomials, and pi(x) = 1 - x for the Laguerre 
polynomials. 

Orthogonal polynomials have many interesting prop- 
erties and find use in many different settings, e.g., in 
numerical integration, Krylov subspace methods, and 
the theory of continued fractions. In this volume they 
arise in least-squares approximation [IV.9 §3.3], 

NUMERICAL SOLUTION OF PARTIAL DIFFERENTIAL EQUA- 
TIONS [IV. 1 3 §6], RANDOM-MATRIX THEORY [IV.24], and 
as SPECIAL FUNCTIONS [IV.7 §7]. See SPECIAL FUNCTIONS 
[IV. 7 §7] for more information. 


11.30 Shocks 

Barbara Lee Keyfitz 


1 What Are Shocks? 

“Shocks” (or “shock waves”) is another name for the 
held of quasilinear hyperbolic PDEs, or conservation 
laws [II. 6]. When the mathematical theory of super- 
sonic flowwas inits infancy, the first text on the subject 
named it this way; and the first modern monograph to 
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focus on the mathematical theory of quasilinear hyper- 
bolic PDEs also used this terminology. Shocks are a 
dominant feature of the subject, for, as noted with ref- 
erence to the burgers equation [111.4] (see also par- 
tial DIFFERENTIAL EQUATIONS [IV.3 §3.6]), solutions to 
initial-value problems, even with smooth data, are not 
likely to remain smooth for all time. We return to the 
derivation to see what happens when solutions are not 
differentiable. 

In the derivation of a system in one space dimension, 

u t +f(u) x = 0, (1) 

one typically invokes the conservation of each compo- 
nent Ui of u. The rate of change of ii,; over a control 
length [x,x + h] is the net flux across the endpoints: 

rX+h 

d t Ui(y,t)dy = fi(u(x,t))-fi(u(x+h,t)). (2) 

J X 

Under the assumption that u is differentiable, the 
mean value theorem of calculus yields (1) in the limit 
h — 0. However, (2) is also useful in a different case. 
If m approaches two different limits, Ui(X(t),t) and 
u R (X(t),t), on the left and right sides of a curve of 
discontinuity, a = X(t), then taking the limit h — • 0 in 
(2) with x and x + h straddling the curve X(t) yields 
a relationship among ml, Mr, and the derivative of the 
curve: 

X’(t)(u R (X(t),t) -u L (X(t),t)) 

= f(u R (X(t),t)) - f(u L (X(t),t)). (3) 

This is known as the (generalized) Rankine-Hugoniot 
relation. The quantity X' {t) measures the speed of 
propagation of the discontinuity at X(t). 

Because solutions of conservation laws are not ex- 
pected to be continuous for all time, even when the ini- 
tial data are smooth, it is necessary to allow shocks in 
any formulation of what is meant by a “solution” of 
(1). Conservation law theory states that a solution of 
(1) may contain countably many shocks, the functions 
X(t) may be no smoother than Lipschltz continuous, 
and there may be countably many points in physical 
(x, t)-space at which shock curves intersect. In the case 
of conservation laws in more than one space dimen- 
sion, the notion of a “shock curve” canbe generalized to 
that of a “shock surface” by supposing that the solution 
is piecewise differentiable on each side of such a sur- 
face. One obtains an equation similar to (3) that relates 
the states on either side of the surface to the normal 
to the surface at each point. However, as distinct from 
the case in a single space dimension, it is not known 


whether all solutions have this structure, or whether 
more singular behavior is possible. 


2 Entropy, Admissibility, and Uniqueness 


Although allowing for weak solutions, in the form of 
solutions containing shocks, is forced upon us by both 
mathematical considerations (they arise from almost 
all data) and physical considerations (they are seen in 
all the fluid systems modeled by conservation laws), a 
new difficulty arises: if shocks are admitted as solu- 
tions to a conservation law system, there may be too 
many solutions (this is also known, somewhat illogi- 
cally, as “lack of uniqueness”). Here is a simple exam- 
ple, involving the Burgers equation. If at t = 0 we are 
given 


u(x, 0) 


Jo, x ^ 0, 

[l, x>0 , 


then 


u(x, t) = 


0 , 


x ^ ft, 


1, x > ft, 
is a shock solution in the sense of (3). But 


u(x, t ) 


0 , 

x/t, 

1 , 


x ^ 0, 

0 < x ^ t, 
x > t, 


is also a solution, and in fact it is the latter, which 
is described as a “rarefaction wave,” that is correct, 
while the former, known as a “rarefaction shock 
[V.20 §2.2],” canbe ruled out on both mathematical and 
physical grounds. A fluid that is rarefying (that is, one 
in which the force of pressure is decreasing), be it a gas 
or traffic, spreads out gradually and erases the initial 
discontinuity, while a fluid that is being compressed 
forms a shock. 

Another mode of reasoning, which has both a math- 
ematical and a physical basis, goes as follows. Suppose 
q is a convex function of u for which another function 
q(u) exists such that 

n(u) t + q(u) x = 0 (4) 

whenever u is a smooth solution of (1). When this is the 
case, we say that (1) “admits a convex entropy.” A cal- 
culation (easy for the Burgers equation and true in gen- 
eral) shows that we should not expect (4) to be satisfied 
(in the weak sense, as an additional Rankine-Hugoniot 
relation like (3)) in regions containing shocks. But since 
q is convex, imposing the requirement that q decrease 
in time when shocks are present forces a bound on 
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solutions. For systems of conservation laws in a sin- 
gle space dimension, this condition, which admits some 
shocks and not others as weak solutions, is sufficient 
to guarantee uniqueness. 

There are a number of other ways to formulate 
admissibility conditions for shocks, including modify- 
ing the system with so-called viscosity terms to make 
it parabolic, 

Ut + f(u) x = EU XX , (5) 

and then admitting only shocks that are limits, as £ — 
0, of classical solutions of this semilinear parabolic 
system. 

3 Shock Profiles 

To a fluid dynamicist, the viscous equation (5) is more 
than an artifice to obtain uniqueness of solutions by 
winnowing out shocks that fail some test. The hyper- 
bolic system (1) may be regarded as an approxima- 
tion to a more realistic physical situation that takes 
into account viscosity, heat transfer effects, and even 
the mean free path of particles (for a gas). If, for 
example, viscous effects are included, the right-hand 
side of (5) takes the form Bu xx , where B = B(u) is 
a matrix that is typically diagonal, typically positive- 
semidefinite, and typically small when the system is 
measured on the length scale of interest. In particular, 
the hyperbolic system (1) gives a good description of 
a flow on that length scale. However, across a shock 
there is a rapid change in m, or at least in some of 
its components, and the hyperbolic approximation is 
not adequate. One approach here is to use the hyper- 
bolic system to uncover the macroscopic features of the 
shock— the speed c = X' and the states on either side, 
using (3)— and then to formulate a traveling wave prob- 
lem for the solution inside the shock. That is, one uses 
techniques from dynamical systems to study solutions 
ii(§) = u(x - ct) of 

-cm' + A(u)u' = B(u)u " , 

U(-oo) = Ml, m(co) = Mr, 
where A is the Jacobian matrix of the flux function /. 

Further Reading 

Bressan, A. 2000. Hyperbolic Systems of Consen’ation Laws: 
The One-Dimensional Cauchy Problem. Oxford: Oxford 
University Press. 

Courant, R., and K. 0. Friedrichs. 1948. Supersonic Flow and 
Shock Waves. New York: Wiley-Interscience. 

Smoller, J. A. 1983. Shock Waves and Reaction-Diffusion 
Equations. New York: Springer. 


II. 3 1 Singularitie s 

P. A. Martin 


The word “singularity” has a variety of meanings in 
mathematics but it usually means a place or point 
where something bad happens. For example, the func- 
tion fix) = 1/x is not defined at x = 0; it has a sin- 
gularity at x = 0: it is “singular” there. In this simple 
example, fix) is unbounded (infinite) at x = 0, but this 
need not be a defining property of a singularity. Thus, 
in complex analysis [IV. 1], /(z) is said to have a sin- 
gularity at a point zq when / is not differentiable at z o. 
For example, /i(z) = l/(z - l) 2 has a singularity at 
z = 1, and /2 (z) = z 1/2 has a singularity at z = 0. Note 
that /i(z) is unbounded at z = 1, whereas / 2 ( 0 ) = 0. 
In complex analysis, singularities of /(z) are exploited 
to good effect, especially in the calculus of residues. 

Before describing more benefits of singularities, let 
us consider some of the trouble that they may cause. 
Elementary examples occur in the evaluation of defi- 
nite integrals when the integrand is unbounded at some 
point in the range of integration. For example, con- 
sider I = Jo x a dx, where ex is a parameter. Integrating, 
I = l/(a + 1) provided a > -1; the integral diverges 
for a ^ -1. (When a = -1, use Jx _1 dx = log|x|.) 
When « < 0, the integrand fix) = x“ is unbounded 
as x — • 0 through positive values. Nevertheless, even 
though fix) has a singularity at x = 0, the singularity 
is integrable when -1 < a < 0, meaning that the area 
under the graph, y = fix), 0 < x < 1, is finite; the area 
is infinite when « ^ - 1 . 

In the example just described we were able to evalu- 
ate the integral I exactly, and this enabled us to exam- 
ine the effect of the parameter a. In practice, we may 
have to compute the value of an integral numerically 
using a quadrature rule (such as the trapezium rule or 
Simpson’s rule). When the integrand has a singularity, 
we are often obliged to use a specialized rule tailored 
to that specific kind of singularity or to use a substi- 
tution designed to remove the singularity. Generally, a 
blend of analytical and numerical techniques is needed 
so as to mollify the effects of the singularity. 

Similar difficulties can occur in many other problems, 
such as when solving boundary-value problems for a 
PARTIAL DIFFERENTIAL EQUATION [IV.3] (PDE). For a 
specific example, consider laplace’s equation [III. 18], 
V 2 m = 0, in the region r > 0, 0 < 0 < /?, where r 
and 6 are plane polar coordinates and the angle sat- 
isfies 0 < P < 2tt. We shall refer to this region as a 
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wedge. If we have boundary conditions u = 0 on both 
sides of the wedge, 0 = 0 and 0 = p, one solution is 
u(r, 0) = r v sin(v0), where v = tt / p. This solution is 
zero at the tip of the wedge (where r = 0), but deriva- 
tives of u may be unbounded there; for example, if we 
take p = |t r, then v = | and the gradient of u is 
unbounded at r = 0. This singular behavior is typical at 
corners and often degrades the performance of numer- 
ical methods for solving PDEs. In fact, it is not even 
necessary to have a corner, as a change in boundary 
condition can have the same effect. For example, take 
p = tt, so that the wedge becomes the half-plane y > 0, 
and then require that u = 0 at 0 = 0 and 3 u/d0 = 0 at 
0 = tt. Then one solutionis u = r 1 12 sin(0/2), and this 
has an unbounded gradient at r = 0, which is the point 
on the straight edge y = 0 where the boundary condi- 
tion changes. This phenomenon was built into the Motz 
problem, devised in the 1940s for testing numerical 
methods; it is still in use today. 

Returning to the wedge, take /? = 2 tt so that the 
wedge becomes the whole plane with a slit along the 
positive x-axis. The two sides of the slit are defined by 
0 = 0 and 0 = 2 tt. Suppose that the boundary condi- 
tions are 3u/30 = 0 at 0 = 0 and 0 = 2 tt. One solution 
of V 2 u = 0 is then u = r 1/2 cos(0/2), and this has an 
unbounded gradient at r = 0. This solution is of inter- 
est in the mechanics of solids [IV.32], where the slit 
represents a crack, u is a displacement, and the gra- 
dient of u is related to the elastic stresses. The linear 
theory of elasticity predicts that the stresses are given 
approximately by r~ ll2 K(0) whenr (the distance from 
the crack tip) is small, where K can be calculated. The 
form of the singular behavior, namely r~ 1/2 , is given by 
a local analysis near the crack tip, but the multiplier K, 
known as the stress intensity factor, requires knowledge 
of the geometry of the cracked body and the applied 
loads. Physically, the stresses cannot be infinite: the 
linear theory of elasticity breaks down at crack tips. 
A more accurate theory might involve plastic effects 
or consideration of atomic structures. However, useful 
predictions about when a cracked object will break as 
the applied loads are increased can be made by exam- 
ining K. This is at the heart of engineering theories 
of fracture mechanics. It is another example where the 
singular behavior can be exploited. 

Returning to mathematics, consider Laplace’s equa- 
tion in three dimensions, V 2 u = 0. One solution 
is 

G(x,y,z;x 0 ,yo,z 0 ) = G( P,Po) = R 1 , 


where 

R = {(x - xo ) 2 + (y - yo) 2 + (z - z 0 ) 2 } 1/2 . 

Here we can regard ( x,y,z ) and (xo,yo,zo) as being 
the coordinates of points P and Po, respectively, and we 
have V 2 G = 0 for fixed Po, provided P =#= Po- As R is the 
distance between P and Po, we see that G is singular as 
P — ■ Po. The function G is an example of a Green func- 
tion, named after George Green. (It is conventional to 
abandon the rules of English grammar and to speak of 
“a Green’s function”; opposing this widespread misuse 
appears futile.) One use of G comes when we want to 
solve Poisson's equation, V 2 w = /, where / is a given 
function. Thus 

w(Po) = [G(P,P 0 )/(P)dP, 

where the integration is over all P. Notice that G could 
be replaced by A/R + H(P,Po), where A is a constant 
and H is any solution of V 2 H = 0. In principle, H can be 
chosen so that G satisfies additional conditions, such as 
boundary conditions; indeed, this was how Green con- 
ceived of his function, as the electrostatic held inside a 
conductor due to a point charge. However, in practice, 
it is usual to use a simple G and then to impose bound- 
ary conditions on u by solving an integral equa- 
tion [IV.4]. We note that the alternative terminology 
fundamental solution is often used, meaning a simple 
singular solution of a governing PDE. 

Singularities can occur in many other contexts. We 
mention two. Suppose we want to solve an initial-value 
problem for a nonlinear PDE, where the initial state 
at time t = 0 is specified and the goal is to calculate 
the solution as t increases. It is then possible that the 
solution becomes unbounded as t -> t c , where t c is 
some finite critical time. This is known as “blow-up”: 
there is a finite-time singularity at t = t c (see partial 
differential equations [IV.3 §3.6]). There is much 
interest within fluid dynamics [IV.28] in the existence 
or otherwise of finite-time singularities because it is 
thought that they may be relevant in understanding the 
nature of turbulence [V.21]. 

Finally, we cannot end an article on singularities 
without mentioning cosmology. It is generally accepted 
that the Big Bang theory gives a model for how the 
universe evolves, starting from an initial singularity. 
Some cosmologists believe that the universe will end 
with a singularity; the trouble caused by this singu- 
larity (if it exists) is unlikely to bother readers of this 
article. 
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11.32 The Singular Value 
Decomposition 

Nicholas J. Higham 


One of the most useful matrix factorizations is the sin- 
gular value decomposition (SVD), which is defined for 
an arbitrary rectangular matrix A e C mxn . It takes the 
form 

A = USV *, S = diag(cri,CT 2 ,...,o- p ) £l“", (1) 

where p = minim, n), S is a diagonal matrix with diag- 
onal elements oi ^ 02 ^ ^ cr p ^ 0, and U e C mxm 

and V G C nxn are unitary. The oy are the singular val- 
ues of A, and they are the nonnegative square roots 
of the p largest eigenvalues of A* A. The columns of 
U and V are the left and right singular vectors of A, 
respectively. 

Postmultiplying (1) by V gives AV = US since V* V = 
I, which shows that the fth columns of U and V are 
related by Avi = CTiUi for i = 1 : p. Similarly, A* m = 
ctiVi for i = l:p. A geometrical interpretation of 
the former equation is that the singular values of A 
are the lengths of the semiaxes of the hyperellipsoid 
{Ax: \\x\\2 = 1}. 

Assuming that m ^ n for notational simplicity, from 
( 1 ) we have 

A* A = V(E*S)V*, (2) 

with S * S = diag ( crj 2 , erf , . . . , cr £ ) , which shows that the 
columns of V are eigenvectors of the matrix A* A with 
corresponding eigenvalues the squares of the singular 
values of A. Likewise, the columns of U are eigenvectors 
of the matrix AA* . 

The SVD reveals a great deal about the matrix A and 
the key subspaces associated with it. The rank, r, of A 
is equal to the number of nonzero singular values, and 
the range and the null space of A are spanned by the 
first r columns of U and the last n - r columns of V, 
respectively. 

The SVD reveals not only the rank but also how close 
A is to a matrix of a given rank, as shown by a classic 
1936 theorem of Eckart and Young. 


Theorem 1 (Eckart-Young). Let A e C mxn have the 
SVD (1). If k < r = rank(A), then for the 2-norm and 
the Frobenius norm, 


min ||A-B|| = ||A-A k || = 

rank(B)=fc 


tr k + 1, 

2 -norm, 

r 

. 1 ^ 
\ i=k + 1 

F-norm, 



Figure 1 Photo of a blackboard, inverted so that white 
and black are interchanged in order to show more clearly 
the texture of the board: (a) original 1067 x 1600 image; 
(b) image compressed using rank-40 approximation A 40 
computed from SVD. 


where 

A k = UD k V*, D k = diag(oi, . . . , cr k , 0, . . . , 0). 

In many situations the matrices that arise are nec- 
essarily of low rank but errors in the underlying data 
make the matrices actually obtained of full rank. The 
Eckart-Young result tells us that in order to obtain a 
lower-rank matrix we are justified in discarding (i.e., 
setting to zero) singular values that are of the same 
order of magnitude as the errors in the data. 

The SVD (1) can be written as an outer product 
expansion 

p 

A = CTiUiV * , 
i=l 

and A k in the Eckart-Young theorem is given by the 
same expression with p replaced by k. If k <k p then 
A k requires much less storage than A and so the SVD 
can provide data compression (or data reduction). As 
an example, consider the monochrome image in fig- 
ure 1(a) represented by a 1067 x 1600 array of RGB 
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values (R = G = B since the image is monochrome). 
Let A e r!067x 1600 CO ntain the values from any one of 
the three channels. The singular values of A range from 
8.4x 10 4 down to 1.3x 10 1 . If we retain only the singular 
values down to the 40th, (740 = 2.1 x 10 3 (a somewhat 
arbitrary cutoff since there is no pronounced gap in 
the singular values), we obtain the image in figure 1(b). 
The reduced SVD requires only 6% of the storage of 
the original matrix. Some degradation is visible in the 
compressed image (and more can be seen when it is 
viewed at 100% size on screen), but it retains all the 
key features of the original image. While this example 
illustrates the power of the SVD, image compression 
is in general done much more effectively by the jpeg 
scheme [VII. 7 §5]. 

A pleasing feature of the SVD is that the singular val- 
ues are not unduly affected by perturbations. Indeed, 
if A is perturbed to A + E then no singular value of A 
changes by more than HTIh- 
The SVD is a valuable tool in applications where 
two-sided orthogonal transformations can be carried 
out without “changing the problem,” as it allows the 
matrix of interest to be diagonalized. Foremost among 
such problems is the linear least-squares problem 
[IV. 10 §7.1] min XG c>* Ilk - Ax || 2 . 

The SVD was first derived by Beltrami in 1873. The 
first reliable method for computing it was published 
by Golub and Kahan in 1965; this method applies two- 
sided unitary transformations to A and does not form 
and solve the equation (2), or its analogue for AA*. 
Once software for computing the SVD became readily 
available, in the 1970s, the use of the SVD proliferated. 
Among the wide variety of uses of the SVD are for text 
mining [VII.24], deciphering encrypted messages, and 
image deblurring. 

Further Reading 

Elden, L. 2007. Matrix Methods in Data Mining and Pattern 
Recognition. Philadelphia, PA: SIAM. 

Golub, G. H., and C. F. Van Loan. 2013. Matrix Computations, 
4th edn. Baltimore, MD: Johns Hopkins University Press. 
Moler, C. B., and D. Morrison. 1983. Singular value analysis 
of cryptograms. American Mathematical Monthly 90:78- 
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11.33 Tensors and Manifolds 

Mark R. Dennis 


We know that the surface of the Earth is curved, despite 
the fact that it appears flat. This is easily understood 


from the fact that the Earth’s radius of curvature 
is over 6000 km, vast on a human scale. This pic- 
ture motivates the mathematical definition of a man- 
ifold (properly a Riemannian manifold)', a space that 
appears to be Euclidean locally in a neighborhood of 
each point (or pseudo-Euclidean, as defined below) but 
globally may have curvature, such as the surface of a 
sphere. 

Manifolds are most simply defined in terms of the 
coordinate systems on them, and of course there are 
uncountably many such systems. Tensors are mathe- 
matical objects defined on manifolds, such as vector 
fields, which are in a natural sense independent of the 
coordinate system used to define them and their com- 
ponents. The importance of tensors in physics stems 
from the fact that the description of physical phenom- 
ena ought to be independent of any coordinate system 
we choose to impose on space and hence should be 
tensorial. 

Our description of manifolds and tensors will be 
rather informal. For instance, we will picture vector 
or tensor fields as defining a vector or tensor at 
each point of the manifold itself rather than more 
abstractly as a section of the appropriate tangent 
bundle. In applications, tensors are frequently used 
in the study of general relativity and cosmol- 
ogy [IV.40], which involves describing the dynam- 
ics of matter and fields using any reference frame 
(coordinate system), assuming space-time is a four- 
dimensional pseudo-Riemannian curved manifold, as 
described below. 

An n-dimensional manifold is a topological space 
such that a neighborhood around each point is equiva- 
lent (i.e., homeomorphic) to a neighborhood of a point 
in ^-dimensional euclidean space [1.2 §19.1]. More 
formally, it can be defined as the set of smooth coordi- 
nate systems that can be defined on the space, together 
with transformation rules between them. In a neigh- 
borhood around each point, a coordinate system can 
always be found that looks locally Cartesian, regardless 
of any global curvature (which can cause the system to 
fail to be Cartesian at other points). 

In practice, each coordinate system on a Rieman- 
nian manifold has a metric, defined below, which is 
possibly position dependent. This enables inner prod- 
ucts between pairs of vectors at each point in the 
space to be defined. The situation is complicated by the 
fact that, at each point, most coordinate systems are 
oblique, as in figure 1. The following description uses 
“index notation,” which suggests the explicit choice of 
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Figure 1 The components of a vector v in an oblique basis 
of unit vectors {ei,e 2 }: contravariant components v 1 , v 2 
follow from the parallelogram rule, and covariant compo- 
nents Vi , vo are found by dropping perpendiculars onto the 
basis vectors. For basis vectors with arbitrary lengths, each 
covariant component is the length shown on the diagram 
multiplied by the norm of the corresponding basis vector. 

a coordinate system. However, such expressions are 
always valid since all tensorial expressions should be 
the same in every coordinate system; that is, they are 
covariant. Explicitly coordinate-free formulations exist, 
although they require a greater familiarity with differ- 
ential geometry than is assumed here and require the 
definition of much new notation. 

Consider a vector v in n-dimensional Euclidean space 
with oblique coordinate axes locally defined by linearly 
independent but not necessarily orthonormal basis vec- 
tors ej, j = 1, . . . , n. The contravariant components of 
v , represented with upper indices v 1 for j = 1 , . . . , n, 
are defined by the parallelogram law; that is, 
n 

v = ^ v’ ej (v J contravariant). 

3 = 1 

Throughout this article, we will assume that the symbol 
v ] represents the vector itself, and not simply the com- 
ponents. This will also apply to objects with multiple 
indices. 

The covariant components Vj, represented with 
lower indices, are those defined by the scalar product 
with the basis vectors (i.e., formally, covariant compo- 
nents are the components of the vectors in the dual 
space to the tangent vector space, with “covariant” here 
not to be confused with the sense previously), i.e., 

Vj = v ■ ej, j = 1, . . . , n (vj covariant). 

If the ej are orthonormal, then the covariant compo- 
nents Vj are the same as the contravariant vf The 
components of the metric tensor gij for the coordinate 
system at this point are then given by gij = e* • Cj, 
with g l i = ( g ^ ) _1 , the inverse of g ij considered as a 
matrix. 


All of the geometry of the local basis is encoded in 
gi f, for instance, covariant and contravariant compo- 
nents are found from each other using the metric ten- 
sor to “raise” and “lower” indices, such as v l = g'-’vj 
and Vi = gijvi. In these expressions, and for the 
remainder of the article, we adopt the Einstein sum- 
mation convention ; that is, when an index symbol is 
repeated, once each in an upper (contravariant) and a 
lower (covariant) position, we assume that the indices 
are summed from 1 to n. The summed index symbol 
i, j, ... is itself arbitrary, i.e., a “dummy index.” 

This procedure generalizes the inner product to 
oblique axes, defining inner multiplication, or index con- 
traction. As usual in linear algebra, objects with multi- 
ple indices can be defined, e.g., T lJ = for 

vectors uf vf pf qf Forming a product of objects 
such as u l vi without contracting the indices is referred 
to as outer multiplication, and in coordinate-free from, 
u'vi is written as u ® v. 

The components of the position vector, in terms of 
the chosen coordinate system, are denoted x l , i.e., in 
Cartesian coordinates x 1 = x, x 2 = y, Differentia- 

tion with respect to a set of contravariant indices is in 
fact covariant, as can easily be verified from a Taylor 
expansion by a small displacement 5x l of a scalar field 
fix 1 ) around a chosen point x l : 

fix 1 + 5x l ) = fix 1 ) + 5x l dif + OiiSx) 2 ). 

The term 8x l dif is first order in 5x l , so it must 
be the same in all coordinate systems; since 5x l is 
contravariant, 3* = d/dx l must be covariant. 

As coordinate systems on curved manifolds are not 
typically orthonormal everywhere, we must assume in 
general that the components of a vector are different 
from those of its dual (which is used to take inner prod- 
ucts). Any vector object may be represented with upper 
or lower indices, which are related by the metric tensor. 
Objects with multiple indices, such as the metric ten- 
sor itself, can also have both indices upper, both lower, 
or a mixture. In fact, since g l i is the matrix inverse of 
gtj, the matrix representation of the mixed metric ten- 
sor g k = gijg 2k is the identity matrix and so is often 
written 5 k , to be understood as a generalization of the 
usual Kronecker symbol with mixed upper and lower 
indices. 

With this formalism an alternative coordinate system 
has components xi' , the different system being repre- 
sented by the prime on the coordinate symbol; at each 
point, the linear transformation between the systems is 
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given by the Jacobian dx i /dx^' = Aj, ; therefore, a vec- 
tor’s contravariant components are transformed to the 
new system by v 1 ' = Aj v J . The inverse transformation 
Aj, not only transforms contravariant coordinates back 
from the primed to the unprimed system but also trans- 
forms unprimed covariant components to the primed 
system, i.e., vr = Aj,Vj. This equivalence guarantees 
that the squared length of the vector is independent 
of the coordinate system, i.e., 

Vi'V 1 = VjA J t ,A k v k = vj5 J k v k = VjV j , 

where S J k is the Kronecker symbol with one upper index 
and one lower index (both unprimed), which may be 
thought of as representing the identity transformation 
from the unprimed coordinate system to itself. 

A general tensor is therefore a (possibly) multicom- 
ponent object, all of whose components transform 
between coordinate systems according to the local 
Jacobian transformation, namely, 

1 k't'~- ~ n i n j /y k' n £' 1 kf -- 

A scalar is a tensorial object with no indices, a vec- 
tor has one index, and in general a tensor with m dis- 
tinct indices is said to have rank m. The ordering of 
the indices of a tensor is of course important, and 
this is maintained regardless of whether they appear 
in upper or lower positions. The principle of covariance 
(either special covariance or general covariance), due to 
Einstein, states that physical laws should be express- 
ible in tensorial form, that is, they should be covariant 
under the class of coordinate transformations being 
considered. 

Although it is tempting to identify tensors such 
as gij as arrays of numbers in particular coordinate 
systems, a tensor on a manifold is in fact properly 
defined independently of any particular coordinate sys- 
tem, and instead its tensorality follows from its indices 
transforming in the appropriate way under coordinate 
transformations. 

Coordinate derivatives diVj (also written Vjj) should 
not be expected to be tensorial, as they follow the pos- 
sibly curved coordinate lines of the arbitrarily chosen 
coordinate system. The covariant derivative 

ViVj = Vi-j = di Vj - rfjVk (1) 

is tensorial, where the connection coefficients or Chris- 
toffel symbols T t k denote a nontensorial object defined 
in terms of coordinate derivatives of the metric ten- 
sor (T k f = 2~ 1 g kt d j g u + d t gtj - d e gij), whose combina- 
tion with the coordinate derivative in (1) does indeed 


yield a tensor. Covariant derivatives of general ten- 
sors pick up a T term for each index (e.g., T^.j = 
Ti(j - T k jT k £ - TjfjTim). The derivative of a scalar is 
therefore automatically tensorial. 

No part of the discussion up to this point explic- 
itly involves the manifold’s curvature. Indeed, a com- 
plicated coordinate system may have nonzero connec- 
tion coefficients and yet describe Euclidean space. The 
tensorality of the covariant derivative is an expression 
of the parallel transport of a vector along a curve. A 
geodesic on a manifold is a curve that is as straight as 
possible, given that the manifold may be curved. Such 
a curve can be constructed by parallel transporting a 
vector as a tangent vector along a curve; a geodesic 
curve z l (s), parametrized by s, therefore satisfies the 
geodesic equation 

d 2 z k k dz 1 dz J 
d5 2 + lJ d5 d5 

which generalizes the equation for a Euclidean straight 
line in Cartesian coordinates (for which the connection 
is zero), and is in general nonlinear. 

Cun'ature exists around a point on a manifold when 
there exists a vector that, upon being parallel trans- 
ported from the point around some infinitesimal closed 
path back to the point, does not return to its original 
direction. This is equivalent to the failure of covariant 
derivatives to commute. For any vector field i/j, 

VjVfcVj - VfcVjUi = jk V£, (2) 

defining the Riemann cun'ature tensor This ten- 
sor has rank 4: the indices j and k are related to the 
plane defined by the derivatives (i.e., the closed path), 
and i is related to the original direction of the vector, 
whose change in direction is related to f. The Riemann 
curvature tensor has many symmetries, such as Rfijk = 
—Ri£jk = Rjkfi i with the result that, although a general 
rank-4 tensor has n 4 components, only n 2 (n 2 - 1 ) / 12 
of the Riemann tensor’s components are independent: 
one for n = 2, six for n = 3, twenty for n = 4, 
etc. A manifold where the curvature tensor vanishes 
everywhere is said to be flat. 

The symmetric, rank-2 “trace” of the Riemann tensor, 
Rik = R J ljk , is known as the Ricci tensor. It plays a cru- 
cial role in the theory of general relativity, as the part 
of the curvature of the space-time manifold affected by 
energy-momentum and the cosmological constant. Its 
trace R = g ,k Rik is called the cun'ature scalar. 

In many applications it is mathematically easier to 
restrict the class of coordinate systems being con- 
sidered in order to gain mathematical tractability at 
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the cost of generality. For example, in problems of 
classical mechanics [IV. 19] in three-dimensional flat 
Euclidean space described by a position vector r, it 
is conventional to work with Cartesian tensors ; all 
coordinate systems are Cartesian, and the transforma- 
tions between them are simply rotations and transla- 
tions. In this case, covariant and contravariant indices 
transform in the same way, so the metric becomes the 
Kronecker symbol in all systems gij = 5ij (the indices 
are both lower here, appropriate for a metric tensor). 

Examples of Cartesian tensors include the inertia 
tensor of a solid body with density p(r), 

hj = p(r)(r k r k 5ij - r t rf) d V, 

Jbody 

which relates the rotating body’s angular momentum 
Li = IijcoJ to its angular velocity «V. Another example 
that is important for continuum mechanics is a body’s 
Cauchy stress tensor cr f j ; at each point, fj = crijn 1 is 
the force acting on a surface perpendicular to the unit 
vector n l . 

The tensorial framework for space-time was intro- 
duced into physics by Einstein, who applied it to the 
four- dimensional manifold of space-time events, in 
which all vectors become four- dimensional 4 -vectors. 
In this formalism, time becomes a spatial coordi- 
nate ct = x° (or, in older literature, x 4 ), where c 
is the speed of light (approximately 3 x 10 s meters 
per second), so that physical laws satisfy the princi- 
ple of special relativity (or special covariance); they 
take the same form in all inertial frames, which are 
regarded effectively as orthonormal coordinate sys- 
tems on flat space-time. This generalization of New- 
ton’s first law of mechanics was necessary to accommo- 
date Maxwell’s equations of electrodynamics, famously 
requiring all inertial observers to agree on the value 
of c. In Einstein’s theory, all inertial observers agree 
on the value of x 2 + y 2 + z 2 - c 2 t 2 , written more 
compactly via the summation convention as g a bX a x b , 
or x a x a \ this is the “space-time separation” of the 
event x a = ( ct,x,y,z ) from the space-time origin, 
where p a b is the Minkowski tensor with the form 
diag (— 1,+1,+1,+1). Conventionally, indices a,b,. . . 
are used to denote four-dimensional space-time indices 
0, . . . , 3, whereas indices i, j,... are used to denote spa- 
tial indices 1,2,3. The different continuous symme- 
tries of Euclidean, Newtonian, and Minkowski space- 
time are discussed in invariants and conservation 
laws [11.21]. 

The four-dimensional flat manifold that admits the 
Minkowski tensor p a b at each point is called Minkowski 


space. Notably, p a b has some negative as well as posi- 
tive entries (as such it is pseudo-Euclideari)\ Minkowski 
space has many similarities geometrically to Euclid- 
ean space. However, the crucial difference is that the 
squared length x a x a of a space-time 4-vector may be 
positive, negative, or zero. 

In special relativity, therefore, space-time is repre- 
sented by Minkowski space, a flat pseudo-Riemannian 
manifold. Space-time coordinate transformations on 
Minkowski space are called Lorentz transformations. 
The choice of the overall sign of the tensor p a b (i.e., 
diag(-l, +1, +1, +1) or diag(+l, -1, -1, -1)) is a con- 
vention, and it is common, particularly in the litera- 
ture on relativistic quantum theory, to use the oppo- 
site sign to that chosen here (it is also common to rep- 
resent the four space-time indices by Greek characters 
reserving Roman characters for the space-like 
indices). The signature of a metric is the sum of signs 
of its entries in a coordinate system when it is diag- 
onal: in a Euclidean space it is n, but in Minkowski 
space it is 2 (or -2, depending on the sign convention). 
The neighborhood of each point in a Riemannian man- 
ifold admits a coordinate system that is locally Euclid- 
ean. Manifolds in which the neighborhood of each point 
admits a coordinate system with metric signature not 
equal to n, such as Minkowski space, are said to be 
pseudo-Riemannian. 

In general relativity, Einstein proposed the principle 
of general covariance, i.e., that physical laws be ten- 
sorial with respect to all obser\>ers (i.e., all frames on 
space-time), not just the inertial ones (for which tensors 
in special relativity become analogous to the restric- 
tion in Euclidean space to Cartesian tensors). In general 
relativity, therefore, space-time is no longer required 
to be flat (i.e., the neighborhood of each point locally 
looks like Minkowski space), allowing for the possibil- 
ity that at larger length scales, space-time itself may be 
curved. The einstein field equations [III. 10] relate 
the Ricci curvature of a manifold to the distribution of 
matter, energy, and momentum in space-time, thereby 
fundamentally tying the nature of space and time to the 
phenomena that occupy it. 

Further Reading 

Bishop, R. L., and S. I. Goldberg. 1980. Tensor Analysis on 

Manifolds. New York: Dover. 

Lorentz, H. A., A. Einstein, H. Minkowski, and H. Weyl. 1952. 

The Principle of Relativity'. New York: Dover. 

Schutz, B. 1980. Geometrical Methods of Mathematical 

Physics. Cambridge: Cambridge University Press. 



11.34. Uncertainty Quantification 


131 


11.34 Uncertainty Quantification 

Youssef Marzouk and Karen Willcox 


Uncertainty quantification (UQ) involves the quantita- 
tive characterization and management of uncertainty 
in a broad range of applications. It employs both com- 
putational models and observational data, together 
with theoretical analysis. UQ encompasses many dif- 
ferent tasks, including uncertainty propagation, sensi- 
tivity analysis, statistical inference and model calibra- 
tion, decision making under uncertainty, experimental 
design, and model validation. UQ therefore draws upon 
many foundational ideas and techniques in applied 
mathematics and statistics (e.g., approximation theory, 
error estimation, stochastic modeling, and Monte Carlo 
methods) but focuses these techniques on complex 
models (e.g., of physical or sociotechnical systems) that 
are primarily accessible through computational simula- 
tion. UQhas become an essential aspect of the develop- 
ment and use of predictive computational simulation 
tools. 

Modeling endeavors may contain multiple sources 
of uncertainty. A widely used classification contrasts 
aleatory or irreducible uncertainty, resulting from 
some inherent variability, with epistemic or reducible 
uncertainty that reflects a lack of knowledge. The more 
detailed classification of uncertainties below was pro- 
posed in the seminal work of Kennedy and O’Hagan, 
and it provides a useful foundation on which to estab- 
lish mathematical approaches. 

Parameter uncertainty refers to uncertain inputs to 
or parameters of a model. For example, parameters rep- 
resenting physical properties (permeability, porosity) 
of the Earth may be unknown in a computational model 
of the subsurface. Parametric variability captures the 
uncertainty due to uncontrolled or unspecified condi- 
tions in inputs or parameters. For example, aircraft 
design must account for uncertain operating condi- 
tions that arise from varying atmospheric conditions 
and gust encounters. Residual variability describes the 
uncertainty due to intrinsic random variation in the 
underlying physics being modeled or induced by behav- 
ior at physical scales that are not resolved by the model. 
Examples of residual variability include the uncertainty 
due to using models of turbulence that approximate 
the effects of the small scales that are not resolved. 
Code uncertainty refers to the uncertainty associated 
with not knowing the output of a computer model given 
any particular configuration until the code is run. For 


example, a Gaussian process emulator used as a surro- 
gate for a higher-fidelity model has code uncertainty at 
parameter values away from those at which the emu- 
lator was calibrated. Obsen'ation error is the uncer- 
tainty associated with actual observations and mea- 
surements; it plays an important role in model cali- 
bration and inverse problems. Model discrepancy cap- 
tures the uncertainty due to limitations of and assump- 
tions in the model. This uncertainty is present in almost 
every model used in science and engineering. 

This article takes a probabilistic view of all these 
sources of uncertainty. While other approaches, e.g., 
interval analysis, fuzzy set theory, and Dempster- 
Shafer theory, have also been employed to analyze par- 
ticular UQ problems, the probabilistic approach to UQ 
offers a particularly rich and flexible structure, and has 
seen extensive development over the past two decades. 

The challenge of model validation is closely related 
to UQ. A recent National Academy of Sciences report 
defines validation as “the process of determining the 
degree to which a model is an accurate representation 
of the real world from the perspective of the intended 
uses of the model.” This process therefore involves 
assessing how the sources of uncertainty described 
above contribute to any prediction of interest. Quan- 
tifying model discrepancy is a particularly challeng- 
ing aspect of the validation process; posterior predic- 
tive checks, cross validation, and other ways of assess- 
ing model error are important in this regard. In situa- 
tions where data are few in number or where the model 
is intended for use in an extrapolatory setting, how- 
ever, traditional techniques for model checking may 
not apply. This is important and somewhat uncharted 
territory, for which new mathematical and statistical 
approaches are being developed. 

1 Characterizing Uncertainty 

Probabilistic approaches characterize uncertain quan- 
tities using probability density or mass functions. 
These approaches require the specification of suffi- 
cient information to endow model inputs with prob- 
ability distributions. This information can be drawn 
from various types of prior knowledge, including his- 
torical databases, physical constraints, previous com- 
putations, and the elicitation of expert opinion. The 
principle of maximum entropy is sometimes used to 
map from a few specified characteristics of the uncer- 
tainty (e.g., minimum and maximum values, or mean, 



132 


II. Concepts 


variance, and other moments) to a probability distri- 
bution by determining the maximum entropy distribu- 
tion that is compatible with the specified constraints. 
Inferential approaches, described below, can update 
probabilistic characterizations of uncertain quantities 
by conditioning these quantities on new observational 
data. 

2 Forward Propagation of Uncertainty 

Forward propagation of uncertainty addresses how 
uncertainty in model inputs translates into uncertainty 
in model outputs. The goal is often to provide distribu- 
tional information (output means and variances, event 
probabilities) in support of uncertainty assessment or 
decision making. The most flexible method for estimat- 
ing distributional information is Monte Carlo simula- 
tion, which draws random samples from the joint dis- 
tribution of inputs and evaluates the output value cor- 
responding to each input sample. Expectations over the 
output distribution are then estimated from these sam- 
ples. Monte Carlo estimates converge slowly, typically 
requiring many samples to achieve acceptable levels 
of accuracy; the error in a simple Monte Carlo estima- 
tor converges as 0(N~ 1/2 ), where N is the number of 
Monte Carlo samples. Variance-reduction techniques, 
such as importance sampling and the use of control 
variates, are therefore of great interest in UQ. Also, 
quasi-Monte Carlo approaches can offer faster conver- 
gence rates than random (or pseudorandom) sampling, 
while maintaining good scalability with respect to the 
number of uncertain parameters. 

Approaches that rely on polynomial chaos and other 
spectral representations of random quantities, such as 
stochastic Galerkin and stochastic collocation meth- 
ods, are important and widely used alternatives to 
Monte Carlo simulation. By exploiting the regularity 
of the input-output relationship induced by a model, 
these approaches can provide greater accuracy and effi- 
ciency than Monte Carlo simulation for moderate num- 
bers of parameters. They can also provide more infor- 
mation about this input-output relationship, including 
sensitivity indices and approximations that are use- 
ful in inverse problems (see below). Stochastic spectral 
techniques have been developed for many classes of 
partial differential equations v\ith random parameters 
or input data, as well as for more general black-box 
models. Methods for identifying and exploiting spar- 
sity in polynomial representations, for evaluating the 
model on sparse grids in high- dimensional parameter 


spaces, and for performing dimension reduction have 
proven quite successful in expanding the range and 
size of problems to which spectral techniques can be 
successfully applied. 

3 Sensitivity Analysis 

Sensitivity analysis aims to elucidate how the uncertain 
inputs of a system contribute to system output uncer- 
tainty. Variance-based sensitivity analysis apportions 
the variance of an output quantity of interest among 
contributions from each of the system inputs and their 
interactions. This apportionment is based on the law 
of total variance, which for a given output quantity of 
interest Q and a given factor Xi is written as 

Var(Q) = E[Var(Q|X f )] + Var(E[Q|*i]). 

From this, a main effect sensitivity index is defined as 
the expected fraction of the variance of Q that would 
be removed if the variance of Xi were reduced to zero: 

= Var(E[Q|Xd) 

1 Var(Q) 

The results of a global sensitivity analysis can be 
used for factor prioritization (identifying those inputs 
where future research is expected to bring the largest 
reduction in variance) and factor fixing (identifying 
those inputs that may be fixed to a deterministic 
value without substantially affecting probabilistic mod- 
el outputs). Algorithms for computing main and total 
effect sensitivity indices may rely on Monte Carlo or 
quasi-Monte Carlo sampling, sparse quadrature, or 
post-processing of the stochastic spectral expansions 
described above. 

4 Inverse Problems and Data Assi mi lation 

inverse problems [IV. 15] arise from indirect observa- 
tion of a quantity of interest. For example, one may 
wish to estimate certain parameters of a system given 
limited and noisy observations of the system’s out- 
puts. From the UQ perspective, one seeks not only a 
point estimate of the parameters but also a quantitative 
assessment of their uncertainty. UQ in inverse prob- 
lems can therefore be addressed through the perspec- 
tive of statistical inference. The statistical inference 
problems that arise in this context typically involve 
a likelihood function that contains a complex physi- 
cal model (described by ordinary or partial differential 
equations) and inversion “parameters” that are in fact 
functions, and hence are infinite dimensional. 
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The Bayesian statistical approach provides a natu- 
ral route to quantifying uncertainty in inverse prob- 
lems by characterizing the posterior probability dis- 
tribution. And yet the application of Bayesian infer- 
ence to inverse problems raises a number of impor- 
tant computational challenges and foundational issues. 
From the computational perspective, important tasks 
include the design of Markov chain Monte Carlo sam- 
pling schemes for complicated and high- dimensional 
posterior distributions; the construction of controlled 
approximations to the likelihood function or forward 
model, whether through statistical emulation, model 
reduction, or function approximation; and the devel- 
opment of more efficient alternative algorithms, includ- 
ing variational approaches, for characterizing the pos- 
terior distribution. From the foundational perspective, 
important efforts center on incorporating model error 
or model discrepancy into the solution of the inverse 
problem and subsequent predictions, and on design- 
ing classes of prior distributions that are sufficiently 
rich or expressive to capture available information 
about the inversion parameters, while ensuring that the 
Bayesian formulation is well-posed and discretization 
invariant. We also note that not all methods for char- 
acterizing uncertainty in the inversion parameters and 
subsequent predictions are Bayesian; many frequentist 
methods for uncertainty assessment (using, for exam- 
ple, Tikhonov regularized estimators) have also been 
developed. 

Data assimilation encompasses a related set of prob- 
lems for which the goal is to estimate the time- 
evolving state of a dynamical system, given a sequence 
of observations. Applications include ocean modeling 
and numerical weather prediction [V. 18]. Observa- 
tions are typically available sequentially, and one there- 
fore seeks algorithms that can be applied recursively- 
updating the state estimates as new data become avail- 
able. When one’s goal is to condition the state at time 
t on observations received up to time t, the infer- 
ence problem is known as filtering ; when the state 
at time t is conditioned on observations up to some 
time T > t, the problem becomes one of smoothing. 
For linear models and Gaussian error distributions, the 
recursive Kalman formulas for filtering and smoothing 
apply. For nonlinear models and non-Gaussian distribu- 
tions, a host of other algorithms can be used to approx- 
imate the posterior distribution. Chief among these 
are ensemble methods, including ensemble Kalman fil- 
ters and smoothers, and weighted particle methods, 
including many types of particle filters and smoothers. 


Practical applications of data assimilation involve high- 
dimensional states and chaotic dynamics, and the 
development of efficient and accurate probabilistic 
approaches for such problems is an active area of 
research. 

5 Decision Making under Uncertainty 

The results of uncertainty analysis and inference are 
often a prelude to designing a system, executing some 
control action, or otherwise making a decision, per- 
haps in an iterative fashion. Optimization techniques 
that account for uncertainty are therefore an important 
component of an end-to-end UQ approach. For exam- 
ple, robust optimization methods define an objective 
function that can incorporate both the mean of some 
performance metric and its variance, thus making a 
trade between absolute performance and variability. 
Multiobjective formulations allow this trade-off to be 
controlled and explored completely. One can also intro- 
duce chance constraints that specify an acceptable level 
of reliability for a system (e.g., by requiring the proba- 
bility of failure to be less than a specified value); in the 
engineering literature, these are known as reliability- 
based design-optimization approaches. More general 
approaches to decision making under uncertainty use 
the framework of decision theory to incorporate the 
probabilistic background of any choice, e.g., choosing 
the action that maximizes some expected utility. Even 
questions of optimal experimental design can be cast 
in this framework: choosing where to place a sensor 
or how to interrogate a system, before the outcome of 
the experiment is known and while other uncertainties 
remain in the model, is yet another instance of decision 
making under uncertainty. 
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11.35 Variational Principle 


A variational principle is a method in the calculus of 
variations [IV.6] for determining a function by iden- 
tifying it as a minimum or maximum of a functional, 
which is a function that maps functions into scalars. An 
example of a functional on a hilbert space [1.2 §19.4] 
H is the mapping from f e H to the inner product 
(f,g), where g is any fixed element in H. In particular, 
Jo f(x) dx is a functional on C[0, 1]. 

Many partial differential equations (PDEs) have the 
property that their solution is a minimum of a cer- 
tain Lagrangian functional. The PDE is known as the 
EULER-LAGRANGE EQUATION [111.12] (see also PARTIAL 
differential equations [IV.3 §4.3]) of the functional. 

In matrix analysis an example of a variational prin- 
ciple is the expression Ai = msx x £o x* Ax I (x* x) for 
the largest eigenvalue of a Hermitian matrix A, which 
is a particular case of the courant-fischer theo- 
rem [IV.10 §5.4]. The expression x* Ax/ (x*x) is called 
a Rayleigh quotient, and generalizations of it arise 
in the Rayleigh-Ritz approximation problem of find- 
ing optimal approximate eigenvectors of A given an 
approximate invariant subspace. 


11.36 Wave Phenomena 


Waves are everywhere. We immediately think of waves 
on the ocean, sound waves, electromagnetic waves such 
as light and radio, and seismic waves caused by earth- 
quakes. Less obviously, there are waves of traffic on 
busy roads, ultrasonic waves used to image the insides 
of our bodies, and the Mexican wave around a football 
stadium. 


From these examples, we all have a good idea of what 
a wave is, but it is not so easy to define a wave. Perhaps 
the key property is propagation : disturbances are prop- 
agated. Think of the Mexican wave; people in the crowd 
stand and sit in an organized manner so as to gener- 
ate a disturbance that propagates around the stadium. 
Notice that the people do not propagate! Similarly, in 
a sound wave air particles move about their equilib- 
rium positions. Electromagnetic waves do not require a 
medium in order to exist; they can propagate through 
empty space whereas sound waves cannot. 

In addition to propagating disturbances, waves are 
often associated with the transfer of energy, and this is 
one reason they are useful. Waves can also interact with 
objects (giving rise to reflection, refraction, diffraction, 
or scattering), or even with other waves. 

For a simple formula, suppose that the x-axis points 
to the right. A disturbance u(x,f) at position x and 
time t will be a wave propagating to the right if it has 
the form 

u(x, t) = f(x - ct), 

where / is a function of one variable and c is a constant 
(the speed of propagation, or the phase speed). For 
an explanation, see the article on the wave equation 
[III.31]. In this very simple one-dimensional example, 
the disturbance propagates without change of shape. In 
reality, the shape may change as the wave propagates 
(think of ocean waves approaching and then breaking 
on a beach). 

Wave phenomena continue to fascinate and provoke 
the creation of new mathematics. 
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III.l Benford’s Law 

Theodore P. Hill 


Benford’s law, also known as the first-digit law or the 
significant-digit law, is the empirical observation from 
statistical folklore that in many naturally occurring 
tables of numerical data the leading significant digits 
are not equally likely. In particular, more than 30% of 
the leading significant (nonzero) digits are 1 and less 
than 5% are 9. 

1 The First-Digit Law 

Benford’s law asserts that, instead of being uniformly 
distributed, as might be expected, the first significant 
decimal digit often tends to follow the logarithmic 
distribution 

Prob(Di = d) = log 10 ^ 1 j, d = 1,2,..., 9, 
so 

Prob(Di = 1) = log 10 (2) = 0.3010..., 

Prob(Di = 2) = log 10 (3/2) = 0.1760 . . . , 

ProbfDi = 9) = log 10 ( 10/9) = 0.04575 

where D\ represents the first significant decimal digit 
(e.g., D\ (0.0203) = Di(203) = 2). 

2 History 

The earliest known reference to this logarithmic dis- 
tribution is a short article in 1881 by polymath Simon 
Newcomb in the American Journal of Mathematics, and 
the article contained not only the first-digit law above 


but also the second-digit law. This paper was forgot- 
ten, and in 1938 Frank A. Benford published an arti- 
cle containing the same first- and second-digit laws, as 
well as extensive empirical evidence of the law in tables 
ranging from baseball statistics to square-root tables 
and atomic weights. This article attracted much atten- 
tion but Newcomb’s note continued to be overlooked 
for decades and the name “Benford” came to be asso- 
ciated with the significant-digit law. Since then, over 
700 articles have appeared on applications, statistical 
tests, and mathematical proofs of Benford’s law. 

3 Empirical Evidence 

Many common tables of numerical data do not follow 
Benford’s law. For example, the proportion of positive 
integers that begin with 1, i.e., {1, 10, 11, . . . , 19, 100, 
101, . . . , 199, 1000, 1001, . . . }, oscillates between | and 
| as the sample size increases. The prime numbers also 
do not follow Benford’s law. Similarly, many tables of 
real-world data, such as telephone numbers and lottery 
numbers, do not follow Benford’s law. 

On the other hand, in addition to Benford’s original 
empirical data, an abundance of subsequent empirical 
evidence of Benford's law has appeared in a wide range 
of fields. This includes numbers gleaned from newspa- 
per articles and almanacs, tables of physical constants 
and half-lives of radioactive substances, stock markets 
and financial data, demographic and geographical data, 
scientific calculations, eBay prices, and the collection of 
all numbers on the Internet, as reflected by magnitudes 
of Google hits on numbers. 

4 Applications 

One of the main applications of Benford's law has been 
for fraud detection. Since certain types of true tax 
data have been found to be a close fit to Benford’s 




136 


III. Equations, Laws, and Functions of Applied Mathematics 


law (e.g., about 30% of the numbers begin with 1), chi- 
squared goodness-of-fit tests have been used success- 
fully to detect fraud or error by checking conformance 
of the first digits of the data to Benford’s law. Benford’s 
law goodness-of-fit tests have also been used to identify 
anomalous signals in data. This has been employed, for 
example, to detect mild earthquakes, and to evaluate 
the detection efficiency of lightning location networks. 

The output of numerical algorithms and other digi- 
tal computer calculations often follows Benford’s law, 
and based on a hypothesis of the output following 
Benford’s law it is possible to obtain improved esti- 
mates of expected roundoff errors, and of likelihoods 
of overflow and underflow errors. 

Benford's law has also been used as an effective 
teaching tool, to introduce students to basic concepts 
in statistics such as goodness-of-fit tests and basic 
data-collection methods, and to demonstrate tools in 
Mathematica. 

5 The General-Digit Law 

The general form of Benford’s law is a statement about 
the joint distribution of all decimal digits, namely: 

Prob(Di = di,r>2 = d. 2 ,...,D m = d m ) 

= iog 10 ( 1 + (f 10 
i = i 

holds for all m-tuples (d\,di, ..., d m ), where d\ is an 
integer in {1,2,..., 9} and, for j ^ 2, dj is an integer 
in {0,1,..., 9}. Here D 2 , D 3 , D 4, etc., represent the sec- 
ond, third, fourth, etc., significant decimal digits, e.g., 
D 2 (0.0203) = 0, £>3 (0.0203) = 3. 

Thus, for example, this general form of Benford’s law 
implies that 

ProblHi = 3,D 2 = 1,D 3 = 4) = log 10 (3 1 5 / 3 14) 

= 0.001380.... 

A corollary of the general form of Benford’s law is that 
the significant digits are dependent and not indepen- 
dent, as one might expect. 

Letting 5(x) denote the (floating-point) significand 
(see floating-point arithmetic [11.13]) of the posi- 
tive number x, e.g., 5(0.0203) = 5(2. 03x 10~ 2 ) = 2.03, 
a more compact form of the general Benford’s law is 

Prob(5 < t) = log 10 t for all 1 ^ t < 10. 

Analogs of Benford's law for nondecimal bases b are 
obtained by simply replacing the decimal base by the 
new base, both in the significant digits or significand 
and in the logarithm. 


6 The Mathematical Framework 

One of the main tools used to study Benford’s law is the 
fact that a data set X (e.g., a sequence, function, or ran- 
dom variable) is Benford if and only if log X is uniformly 
distributed modulo 1, that is, if and only if the frac- 
tional part of log A is uniformly distributed between 0 
and 1. 

Two key characterizations of Benford's law are scale 
invariance and base invariance. The Benford distribu- 
tion is the only distribution of significant digits that 
does not change under multiplicative changes of scale. 
For example, if a data set originally in euros or meters 
follows Benford’s law, conversion of the data into dol- 
lars or feet will also follow Benford’s law. Similarly, 
the Benford distribution is the only distribution of sig- 
nificant digits that is continuous and invariant under 
changes of base. 

7 Sequences and Functions 

Many common sequences follow Benford’s law exactly. 
That is, the proportion of times that particular signif- 
icant digits appear in the elements of the sequence 
converges to the exact Benford’s law probabilities. For 
example, the sequence of powers of 2 (and of 3 or 5), 
the Fibonacci and Lucas numbers, and the sequence 
of factorials 1!, 2!, 3!, 4!, ■■ ■ = 1,2,6,24,... all follow 
Benford’s law exactly. 

Sequences exhibiting exponential growth (or decay) 
generally obey Benford’s law for almost all starting 
points and almost all bases. Similarly, many general 
classes of algorithms, including Newton’s method, and 
multidimensional systems such as Markov chains can 
also be shown to obey Benford’s law. Continuous func- 
tions with exponential or super-exponential growth or 
decay also typically exhibit Benford's law behavior, 
and thus wide classes of initiaf-value problems obey 
Benford’s law exactly. 

8 Random Variables and 
Probability Distributions 

None of the classical probability distributions— uni- 
form, exponential, normal, Pareto, etc. — is arbitrarily 
close to Benford’s law for any values of the parameters, 
although the standard Cauchy distribution comes quite 
close. On the other hand, it is easy to construct distri- 
butions that satisfy Benford’s law exactly, such as the 
continuous distribution on [1, 10) with density propor- 
tional to 1/x. Some of the basic probabilistic Benford’s 
law results are given below. 
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• X is Benford if and only if 1 / X is Benford. 

• If a random variable X has a density, then the pow- 
ers X n of X converge in distribution to Benford’s 
law (i.e., P(S(X n ) < t) — logt). 

• If X is Benford and Y is positive and independent 
of X, then XY is Benford. 

• The product of independent and identically dis- 
tributed continuous random variables converges 
in distribution to Benford’s law. 

• If random samples are taken from probability dis- 
tributions chosen at random (in an unbiased way), 
the combined sample will converge to Benford’s 
law. 
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III. 2 Bessel Functions 

P. A. Martin 


F. W. Bessel (1784-1846) was a German astronomer. In 
an analysis of planetary motions, published in 1826, he 
investigated properties of the integral 

1 r 2n 

Jn(z) = - — cos(nd — z sin 0) dQ. 

2tt Jo 

(He denoted the integral by /”.) Schlomilch (1856) 
regarded J n (z) as coefficients in the expansion 

00 

exp{(z/2)(t - 1 / 1) } = X t n Jn(z)\ (1) 

n=- 00 

the left-hand side is called a generating function. 
Both approaches are convenient when n is an integer. 
Another approach, which generalizes more easily, is via 
an ordinary differential equation, 

z 2 w" (z) + ziv'(z) + (z 2 - v 2 )w(z) = 0, (2) 


in which the independent variable, z, and the parame- 
ter v can be complex. Solutions of Bessel’s equation (2) 
can be constructed by the method of Frobenius. One of 
these is 


Jviz) 


“ (-l) m (z/2) v+2m 

m\T(v + m + 1) 
m = 0 v ’ 


which is known as the Bessel function of the first kind. 
Here.T denotes the gamma function [III. 13]. If v is not 
an integer, J_ v (z) gives a second, linearly independent, 
solution of (2). However, when v = n, an integer, we 
have J n (z) = (-1 ) n J- n (z) (to see this, replace t by 
-1/t in (1)), so a new solution is required. It is 

Jviz) cos vtt - J- V (z) 

Y v (z) = : , 

sm V7T 

with Y n = lim v - n Y v . This defines the Bessel function of 
the second kind. 

The functions J v (z) and Y v (z) are examples of spe- 
cial functions [IV.7 §9] of two variables, v and z. 
In elementary applications, v is an integer, z is real, 
and both are nonnegative. For example, the vibrational 
modes of a circular membrane of radius a (such as a 
drumhead) have the form 

J n ( kr ) cos(n0) cos(cot), 

where r and 0 are plane polar coordinates, t is time, 
to is frequency, k = uojc, c is the speed of sound 
in the membrane (it is a constant depending on what 
the membrane is made from), and ka is chosen as any 
one of the positive zeros of J n (x) (there are infinitely 
many). 

Much is known about the properties of Bessel func- 
tions. The standard reference is G. N. Watson’s 800- 
page book A Treatise on the Theory of Bessel Func- 
tions (Cambridge University Press, Cambridge, 2nd edn, 
1944). However, a good place to start is chapter 10 of 
NIST Handbook of Mathematical Functions (Cambridge 
University Press, Cambridge, 2010), edited by F. W. J. 
Olver, D. W. Lozier, R. F. Boisvert, and C. W. Clark. (An 
electronic version of the book is available at http:// 
dlmf.nist.gov.) 


III. 3 The Black-Scholes Equation 


The Black-Scholes equation is a linear parabolic partial 
differential equation of the form 


a iV ; 
at 


2 ds 2 


rS™ 

BS 


rV = 0. 


It is associated with the problem of pricing a financial 
option whose value is V = V(S,t). The underlying asset 
has price S ^ 0 at time t e [0, T], where T is the expiry 
time of the option. The equation also involves the 
volatility cr and the interest rate r. Appropriate bound- 
ary conditions must be added in order to determine 
V uniquely. Under appropriate changes of variable the 
Black-Scholes equation transforms into the diffusion 
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(heat) equation [III.8]. The Black-Scholes equation is 

used in the derivation of the black-scholes option III. 5 The Cahn-Hilliard Equation 
pricing formula [IV. 14 §2.2], Amy Novick-Cohen 


III.4 The Burgers Equation 


1 The Equation 


The one-dimensional Burgers equation is a nonlin- 
ear PARTIAL DIFFERENTIAL EQUATION [IV.3] (PDE) for 
u(x, t), 

du du 3 2 u 


, + ii- 

St dx 


dx 2 ’ 


( 1 ) 


where v is a positive constant. It is named after J. M. 
Burgers (1895-1981), although the importance of (1) 
had been first recognized (1915) by Bateman in his 
study of the navier-stokes equations [III.23] as the 
viscosity v — 0. 

When v = 0, (1) simplifies to 


du du 
~dt +U dx = °' 


( 2 ) 


This is known as the inviscid Burgers equation, although 
the word “inviscid” is sometimes omitted. 

There is a connection between (1) and the diffusion 
equation [III.8] known as the Cole-Hopf transforma- 
tion. Thus, if w(x, t ) solves 

d 2 iv 1 die 
dx 2 v dt ’ 


then 

2v dw 
U = A 

w dx 

solves (1). This is an example of solving a nonlinear PDE 
using solutions of a related linear PDE. 

The Cole-Hopf transformation can be used to show 
that, if (1) is solved with specified initial conditions at 
t = 0, then the solution is smooth for all x and for all 
t > 0. On the other hand, the inviscid Burgers equa- 
tion (2) can have discontinuous solutions (“shocks”). 
We can say that the presence of the diffusive term on 
the right-hand side of (1) “regularizes” the problem and 
prevents shocks from appearing. Energy is removed 
because v > 0. 

If we change the sign of the diffusive term, energy 
is added. To compensate for this, another term can 
be appended. One PDE of this kind is the Kuramoto- 
Sivashinsky equation, 

du du d 2 u d 4 u 

dt + U dx dx 2 dx 4 ' 

It is encountered in the modeling of several physical 
phenomena, it has been studied extensively, and it has 
many interesting kinds of solutions. 


The partial differential equation for u = u(x,t) e R, 
ut = MA(-u + u 3 + e 2 Au), (x, t) e Qt, 
is known as the Cahn-Hilliard equation. Here, A := 
Zi = i S 2 . and Qt = O x (0, T), where Q c R N , T > 0. 
It was proposed in 1958 by John W. Cahn and John E. 
Hilliard to describe phase separation in binary alloys. 
In that context, u represents the locally defined mass 
fraction of one of the two components of the binary 
alloy, M is the “mobility,” and e measures the effective 
length scale of the interatomic forces. 

2 Structure 

Equation (1) may by written in the coupled form 
Ut = V ■ (MS/p), ( x,t ) e Qt, 

p = f (u) - e 2 Au, ( x,t)eQr , 
where p = p(x,t) is the “chemical potential,” and 
f(u) := l (1 - u 2 ) 2 has minima at u = ±1 and is 
referred to as a “double-well potential.” 

Typically, Q c R N is a bounded domain, N = 3, and 
periodic or Neumann and no flux boundary conditions 
(h-Wu = h-W p = 0 along 3 Q for h x 313) are imposed. 
In both these cases, integrating (1) over Q, and then 
integrating by parts, yields that 

^ f u(x, t ) dx = 0, 
df Jn 

which may be interpreted as stating that mass is con- 
served in the system. 

Multiplying the first equation in (1) by p and then 
integrating over Q, we get 

£ f (/(«) + §£ 2 |Vu| 2 )dx = - f M\V p\ 2 dx, 
dt Jn Jn 

which describes energy dissipation in the system. 

3 Dynamics 

It is often reasonable to assume that 
it(x,0) = ii + u(x,0), 

where u e R and h(x,0) is a small disturbance 
or “perturbation” of u that satisfies f n u(x, 0) dx = 
0. The subsequent dynamics can be described by an 
early “spinodal” regime followed by a late “coarsen- 
ing” regime. During the spinodal or “linear” regime, the 
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Cahn-Hilliard dynamics is dominated by its lineariza- 
tion about u. Spatial perturbations of sufficiently long 
wavelength grow, while shorter-wavelength perturba- 
tions decay exponentially. Locally spatially uniform 
domains begin to form. The larger domains start to 
grow at the expense of the smaller domains; this behav- 
ior is known as “coarsening.” The coarsening dynam- 
ics can be described via a free boundary problem for 
the motion of the interfaces, T(f), which partition Q 
into locally uniform subdomains. In this free bound- 
ary problem, known as the “Mullins-Sekerka” problem, 
the normal velocity of T(f ) is proportional to the jump 
in the normal derivative across T(t) of p, the chemical 
potential. Moreover, p = k along T(t), where k denotes 
the mean curvature of T(t), and Zip = 0 away from the 
interfaces, in Q \ F{t). The late-time dynamics predict 
that the average size of the uniform subdomains grows 
at a rate proportional to t 1/3 . 

4 Applications 

Although the mass-conservative Cahn-Hilliard equa- 
tion was conceived as a model for phase separation in 
binary alloys, the features of its dynamics, with an early 
linear regime followed by a late coarsening regime, 
make it an appropriate model in a wide range of set- 
tings. It has been used to describe pattern formation 
in populations, structure formation in biofilms, galaxy 
formation, as well as in image processing. 

5 Generalizations 

Many generalizations of the Cahn-Hilliard equation 
have been suggested. These include conserved phase 
field models (models that couple the Cahn-Hilliard 
equation with thermal effects), models for simultane- 
ous phase separation and ordering (models coupling 
Cahn-Hilliard and Allen-Cahn equations), models cou- 
pling the Cahn-Hilliard equation with hydrodynamics 
effects, phase field crystal models (generalization that 
include crystalline anisotropy effects), and more. 

Further Reading 

Cahn, J. W. 1961. On spinodal decomposition. Acta Metal- 
lurgica 9:795-801. 

Cahn, J. W., and J. E. Hilliard. 1958. Free energy of a 
nonuniform system. I. Interfacial free energy. Journal of 
Chemical Physics 28:258-67. 

Novick-Cohen, A. Forthcoming. The Cahn-Hilliard Equa- 
tion: From Backwards Diffusion to Surface Diffusion. Cam- 
bridge: Cambridge University Press. 
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III.6 The Cauchy-Riemann Equations 

Let u(x,y) and v(x,y) be two real functions of two 
real variables, x and y. The Cauchy-Riemann equations 
are 

3 u dv , 3 u dv 

x- = y- and ~ - . (1) 

ox oy ay ox 

These equations arise in complex analysis [IV. 1]: if 
/(z) = u(x,v ) + i v(x,y) is an analytic function of 
z = x + iy, where u and v are real, then u and v satisfy 
the Cauchy-Riemann equations. 

Eliminating v from (1) shows that laplace’s equa- 
tion [III. 18] is satisfied by u: 

3 2 u 3 2 u 
dx 2 + 3 y 2 

v satisfies the same equation. 

In plane potential flow of a fluid, u is identified as 
a velocity potential and v as a stream function. If we 
define vectors n = (ni,ri 2 ) and t = (-n 2 ,ni), so that 
n ■ t = 0, from (1) we have 

n ■ grad u = t ■ grad v . 

If we take n as being a unit normal vector to a curve C, 
then t is a unit tangent vector. In fluid flow, a typical 
boundary condition is zero normal velocity on C, n ■ 
grad u = 0. This derivative condition on u can then be 
replaced by v = const, on C, a condition that is often 
easier to enforce. 


IIL 7 The Delta Function and 
Generalized Functions 

P. J. Upton 


The Dirac 8 function, denoted by S(x) and named after 
its inventor, the twentieth-century theoretical physi- 
cist P. A. M. Dirac, may be thought of as a function 
that is so tightly peaked about the value x = 0 that 
it is zero everywhere except at x = 0 and yet its inte- 
gral 5(x) dx = 1. This means that, strictly speak- 
ing, <5(x) is not a function at all, since its value at 
x = 0 is ill defined. However, as will become clear later 
on, the “function” 5(x) can be used as a convenient 
shorthand notation for the limit of a sequence of well- 
defined functions, g n (x), as n — oo. The 5 function 
is used widely across physical applied mathematics, to 
model, for example, impulses in mechanics and elec- 
tric circuit theory, and point-like charge distributions 
in electromagnetism or gravity. 



140 


III. Equations, Laws, and Functions of Applied Mathematics 


So why define such a strange object as Six)? One 
answer is that many physical situations can be de- 
scribed well using the S function. For example, in one- 
dimensional mechanics a particle with mass m may be 
subject to a force fit) present for only a very short 
time interval - to < t < to- Consider the total impulse 
defined by the integral 

f to f to du 

/ (t) dt = m— dt = mv(to) -mv(-to), 

J-t 0 1 — to dt 

where the first equality follows from Newton’s second 
law, and v(t) is the velocity of the particle at time 
t. Suppose that the force acts in such a way that it 
causes the particle, initially at rest, to move off with 
constant velocity and its momentum set to unity, so 
that v(-t o) = 0 and mv(t o) = 1. Furthermore, we 
require that the force acts over such a tiny time inter- 
val that to is practically zero. In other words, the force 
acts instantaneously, as might be expected if /(f) were 
the result of a sudden hammer blow or kick, and the 
precise form of f(t) when -fo < t < to does not signif- 
icantly affect the outcome. In order to model this kick, 
fit) must be zero everywhere except at t = 0 but with 
unit total integral, corresponding to a unit impulse. We 
are therefore led to a 5 function as a model for an 
impulsive force. The subsequent motion (or response) 
of a particle due to an impulsive force is an important 
function of t. It is an initial-value Green function, G(t), 
named after the early nineteenth-century mathematical 
physicist George Green. Green functions are, more gen- 
erally, the solutions of nonhomogeneous differential 
equations with a 5 function as the source term. 


2 Green Functions 

Linear initial-value problems, such as those in which we 
seek the position of a particle, x(t), at time t, involve 
solving a differential equation Lx = fit), where L is 
a linear differential operator and fit) some arbitrary 
force. The general solutionis given by x = x c +x p . Here, 
x c is the complementary function and it solves Lx c = 0, 
while x p , the particular integral, can be expressed in 
terms of a Green function, Git), that is, the response 
to the impulsive force 5 (t). Thus, with LG = Sit), we 
have x p it) = Git - t')fit') dt'. 

The 5 function defined on R 3 , where 

Sir) = Six)Siy)8iz) for r = (x,y,z) e R 3 , 

can be used to model a point-like charge distribution. 
This then enables the determination of the potential, 
4>ir), r e R 3 , for general charge distributions, pir), 
which requires solving Poisson’s equation [III. 18] 
-A<p = pir) )A denotes the Laplace operator). Again, 
this can be done using a Green function, G(r), which 
solves -AG = Sir), so that Gir) is the potential for 
a point-like charge, from which we find that Gir) = 
l/(4rr|r| ), i.e., the Coulomb potential, and that 4>ir) = 
J M 3 Gir - r')pir') dr'. 

When treating problems involving both time evolu- 
tion and spatial variation, such as in quantum theory 
and the theory of Brownian motion, the Green func- 
tion depends on time as well as position. In these cases, 
Git,r) is often referred to as a propagator. 

3 Delta-Convergent Sequences 


1 Basic Properties 


As discussed above, the 5 function has the property 


that 



Jl if a < 0 < b, 
[o if 0 $ [a, b]\ 


cases where a = 0 or b = 0 will not be considered. The 5 
function is not strictly a function at all, as x = 0 is in the 
domain of the “function” yet 5(0) cannot be assigned 
a value in the codomain. Strictly speaking, Six) has a 
meaning only when it appears under an integral sign, 
in which case the following important basic property is 
satisfied: 

/ (x) Six) dx = / (0) (1) 

for a sufficiently well-behaved function fix). The iden- 
tity Six) dx = 1 is a special case of (1). 



The 5 function can be constructed by taking a limit of 
a sequence of well-defined functions g n ix) as n — co. 
If g n is smooth (that is, differentiable at all orders) 
for all n, then the sequence is said to define a gen- 
eralized function. In addition, for all n, we insist on 
the condition g n )x) dx = 1. An example of such 
a 5-convergent sequence is given by the sequence of 
Gaussian functions 

gnix) = -^ =e~ (nx)2/2 , n^l. (2) 

Examples of this g n ix) are plotted in figure 1(a). As 
n increases, the functions g n M get more strongly 
peaked (both narrower and taller) about x = 0. Indeed, 
from the expression in (2) it follows that as n — co, 
g n (x) -» 0 for x * 0 but g n i 0) — ■ co, as expected if 
the limit is Six). Also, g n M dx = 1, n ^ 1, must 
hold, in view of the expression /“«, e~* 2 dx = jtt for 
the integral of a Gaussian function. Thus, taking into 



III. 7. The Delta Function and Generalized Functions 




x 


Figure 1 (a) The functions g n ix) as defined in (2). (b) Their 
derivatives g' n ix) for n = 1, 2, 5, and 10. These sequences 
converge to 8(x) and S'(x), respectively. 

account these properties of g n ix) in (2), it begins to 
look plausible that g n )x) ->• Six) (in some sense) as 
n — oo. Indeed, one can prove that for sufficiently well- 
behaved functions fix) (such as Lipschitz-continuous 
functions), the limit 

r CO 

lim g n (x)f(x)dx = f(0) (3) 

J- 00 

holds. Hence, the limit of g n (x) as n — ■ oo satisfies the 
basic property of the S function given by (1), at least 
for Lipschitz-continuous functions. But with a different 
5-convergent sequence, g n (x), n > 1, one can prove 
that (3) holds for all continuous functions fix). In this 
case, the functions g n (x) are smooth but compactly 
supported, that is, nonzero only on bounded intervals 
of M. 

4 Derivatives of the Delta Function 

One can define a generalized function corresponding 
to the derivative of the delta function, 5' (x), through a 
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(5' -convergent) sequence of functions, g r n (x), n ^ 1. 
An example of such a sequence is illustrated in fig- 
ure 1(b), where g' n ix) is the derivative of g n (x) given 
by (2). Such a process has an appealing physical inter- 
pretation. Just as we can think of 8(x) as a model for 
a one-dimensional charge distribution of a point-like 
electric charge, we can regard S' (x) as that of an elec- 
tric dipole consisting of two oppositely signed point- 
like charges that are infinitesimally close together. 
From the limiting properties of g' n {x), one can show 
that the dipole moment of this charge distribution is 
given by x5'(x) dx = -1. Indeed, this is a spe- 
cial case of the more general result for derivatives of 
general order k, 

r 00 

| 5 (k \x)f(x)Ax = (-l) k f k \Q), 

J — 00 

for sufficiently well-behaved functions fix). 

5 Toward a Rigorous Theory 

A rigorous theory of generalized functions, also called 
distributions, can be constructed by defining them as 
continuous linear functionals, Tiqp), on the space of all 
smooth, compactly supported test functions cp(x) on 
M. The derivative of T)qp) is defined by dT(cp)/dx = 
-Tiqp'). If Tiqp) can be expressed in the form Tiqp) = 
J“oo gix)<pix) dx, where g)x) is a well-defined locally 
integrable function, then Tiqp) is said to be a regu- 
lar generalized function; otherwise, it is called singu- 
lar. The 5 function is the singular generalized func- 
tion defined by Tiqp) = qpi 0). So, regular generalized 
functions assign a functional Tiqp) to each well-defined 
function gix). But, by extending the range of allowed 
Tiqp) to include singular generalized functions, one 
can think of gix) as being part of a much larger set, 
hence the term “generalized” function. 

Another approach to the rigorous treatment of the 
S function, and one that is particularly useful in prob- 
ability theory, is to regard it as a measure. The Dirac 
S measure centered at a point a e R, denoted by S a , 
acts on sets A c M and is defined by the property 
that 5 a iA) = 1 if a 6 A and 5 a iA ) = 0 otherwise. 
A measure-theoretic version of the basic property (1) 
then follows from J R fix)5 a (dx) = fia). 

Further Reading 
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Generalised Functions. Cambridge: Cambridge University 
Press. 


142 


III. Equations, Laws, and Functions of Applied Mathematics 


III.8 The Diffusion Equation 


The one-dimensional diffusion (or heat) equation is 
a PARTIAL DIFFERENTIAL EQUATION [IV.3] (PDE) for 
u(x, t), 

d 2 u 1 du 

d^ = k~dt’ (1) 

where k is a constant. It is classified as a linear second- 
order homogeneous parabolic equation. 

The diffusion equation has solutions that are sepa- 
rated, u(x, t) = e yx e y ~ kt , where y is an arbitrary con- 
stant. There are also nonseparable solutions such as 
u(x, t) = f~ 1/2 e~^\ where <fi(x, t ) = x 2 /(4kt). 

Equation (1) describes the conduction of heat along 
a bar of metal and many other diffusion processes. In 
financial mathematics, there is a famous PDE known as 
the BLACK-SCHOLES EQUATION [HI. 3] 


dv_ 

dt 


1 CT 2 S2 3!V 
2 CT * dS 2 


. rS — 
dS 


rV = 0, 


( 2 ) 


where cr and r are constants and the independent 
variables are S and t. Solutions of this PDE can be 
constructed from solutions of the diffusion equation. 
Thus, the substitutions u(x,t ) = e ax+ ^ t V and e x = S 
in (1) show that V solves (2) if we take k = -^cr 2 , 
ix = - j(l +r/k), and f = \k(l - r Ik) 2 . 

the burgers equation [III.4] is another PDE that can 
be solved using solutions of (1). 


III.9 The Dirac Equation 

Mark R. Dennis 


In quantum theory, the Dirac equation is the relativis- 
tic counterpart to schrodinger’s equation [III.26], 
representing the space-time dependence of an electron 
wave packet. It is one of the most fundamental equa- 
tions in physics, combining the formalism of quan- 
tum physics with special relativity, and its solutions 
naturally lead to the concept of antimatter. 

In quantum mechanics [IV.23], when considering 
the wave function ¥ '(r, t) of a quantum particle, quan- 
tities such as position in 3-space r, time t, momentum 
p, and energy E > 0 are related to multiplication opera- 
tors or the product of differential operators multiplied 
by the imaginary unit i times the quantum of action, 
Planck’s constant h. These include the energy opera- 
tor ifi3 t and the momentum operator with components 
-i hdj, with 3 j for j = 1,2,3 denoting derivatives on 
the spatial coordinates of r. The Schrddinger equation 


ihdt'E = HY thus expresses equality of the energy oper- 
ator and a Hamiltonian operator H when each acts on 
the quantum wave function Y ( r , t). However, in special 
relativity, the energy-momentum relationship is 

E 2 = \p\ 2 c 2 + m 2 c 4 (1) 


for a particle of rest mass m, with constant c the speed 
of light. We will refer to the quantum particle of interest 
as an “electron” and denote its electric charge by e. 

The Klein-Gordon equation is one possible quantum 
equation corresponding to (1), 

c~ 2 d 2 cp - W 2 cp + Trl ^--qp = 0 , ( 2 ) 

n- 

for a generally complex-valued wave function c p(r,t). 
However, there are physical problems in interpret- 
ing a solution of (2) as the relativistic counterpart of 
a Schrodinger wave function: unlike the Schrodinger 
equation, (2) involves the second derivative of time, so 
two initial conditions at t = 0 are required to spec- 
ify a solution (whereas only one is required for the 
Schrodinger equation), and there are problems with 
defining a continuity equation for |cp| 2 as a probability 
density. Furthermore, (2) does not include the two com- 
ponents of electron spin. Therefore, to write down a 
quantum mechanical partial differential equation (PDE) 
corresponding to (1) that is first order in time, one must 
find an appropriate square root of the overall operator 
acting on the left-hand side of (2). 

Paul Dirac famously resolved this problem alge- 
braically, proposing the Dirac equation 
3 

ihd t ip = (Xjcpjtp + fine 2 ip . (3) 

j = i 

Unlike the Schrodinger and Klein-Gordon equations, (3) 
is a vector-matrix equation, where ip = i p(r,t) is a 
wave function with four complex components, called 
a Dirac spinor (the components are not related to four- 
dimensional space-time), and oq, f are the 4x4 Dirac 
matrices, most commonly written in block form: 



Here, 0 and 1 denote the 2x2 zero and identity 
matrices, and the aj denote the three Pauli matrices: 


Ol 




The square of each Dirac matrix so defined is the iden- 
tity matrix, and otherwise the matrices anticommute; 
that is, if i * j then exj «£ + au <Xj = (Xjfi+fiiXj = 0. These 
properties imply that the square of the operators on 
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the left- and right-hand sides of (3) gives the operator 
equivalent of (1). Thus, the p matrices themselves 
form a Clifford algebra, known as the Dirac algebra. 

When considering solutions in the nonrelativistic, 
low-velocity limit \p\/m <K c, the Dirac equation 
asymptotically approaches the Schrodinger equation 
for a two-component spin (in the first two components 
of the Dirac spinor; the second two are vanishingly 
small), recovering the familiar nonrelativistic quan- 
tum behavior of the electron. Dirac therefore found a 
relativistic, quantum PDE that is first order in space 
and time, albeit requiring the quantum electron to be 
described by the four-component Dirac spinor (//, each 
component of which, in fact, satisfies the Klein-Gordon 
equation (2). The fact that the Pauli spin matrices — 
previously included ad hoc in the Schrodinger equa- 
tion to explain the behavior of the electron’s quantum 
mechanical spin— appear naturally in the Dirac algebra 
has been seen as one of the great successes of the Dirac 
equation as the fundamental quantum equation for the 
electron. 

Although a matrix-vector equation, the form (3) of 
the Dirac equation is similar to the Schrodinger equa- 
tion, and indeed the operator Xj=i a j c Pj + fimc 2 is 
referred to as the Dirac Hamiltonian. The Dirac equa- 
tion may also be written in a relativistically covari- 
ant form, by defining the set of four gamma matrices 
y° = j8, y J = y°«j, enabling (3) to be rewritten as 
3 

i h ^ y a d a <p = mcip, (4) 

a = 0 

with do = c~ 1 dt- Usually, this form of the Dirac equa- 
tion is written in the Einstein summation convention 
(omitting explicit summation symbols for repeated 
4-vector components, as in the article on tensors and 
MANIFOLDS [11.33]). 

It is natural to define a current 4-vector j a = 
<P^y°y a P, where a = 0, ...,3, and i/U is the adjoint 
(conjugate transpose) of ip. The 0-component (//+(// 
is the nonnegative-definite probability associated with 
the Dirac particle described by ip, and the other compo- 
nents (/U cxjip give a 3-velocity field, and hence the elec- 
tric current on multiplication by e. The 4-divergence 
of this current vanishes: Xa daj a = 0. This is inter- 
preted as both the local conservation of probability 
and, after multiplication by e, the local conservation of 
the electric charge distribution determined by ip. The 3- 
velocity so defined is different from the 3 -momentum, 
ip^pjip, j = 1-2,3. This distinction between velocity 
and momentum gives rise to some surprising aspects 


of Dirac particles, such as the difference between quan- 
tum numbers associated with the electron’s magnetic 
moment (related to velocity) and its angular momen- 
tum (related to momentum). For the remainder of this 
article we will adopt the convention in relativistic quan- 
tum theory of working in units where the constants c 
and h are unity. 

The Dirac equation, being a PDE for a multicompo- 
nent wave field dependent on space and time compat- 
ible with special relativity, is comparable to the set of 
maxwell’s equations [III.22], which has similar prop- 
erties (particularly in the case when m = 0, which is 
sometimes referred to as Weyl’s equation). In particu- 
lar, it has four linearly independent plane-wave solu- 
tions, of the form exp(i (-Et+p ■ r ) ) times the constant 
unnormalized Dirac spinors 

(pp W = (E+m, 0, P3, Pi + iP2), 

<pp W = (0 ,E + m,p i -ip2,~P3), 

<Pp W = (— P3, ~(pi + \pi),-E + m,0), 

<Ppw = (-(P i -ip2),P3,0,-£ + m). 

The solutions involving ip^ w and p 2 w are interpreted as 
electron plane waves in spin up and spin down states, 
respectively, especially in the nonrelativistic regime 
p <K m. However, ipp W , ipp W appear to be negative 
energy solutions, reflecting the fact that the relativis- 
tic energy relation (1), involving only E 2 , should math- 
ematically admit negative energies as well as positive 
energies. 

The existence of negative energy solutions is one of 
the most striking aspects of the Dirac equation (similar 
solutions also exist for the Klein-Gordon equation (2)), 
not least because physically they appear to preclude 
energetic ground states; it suggests that relativistic 
electrons may constantly decay to successively lower 
energies without bound, at odds with our physical expe- 
rience. Dirac himself proposed the following resolution 
to this problem: the negative energies are already filled 
with a Dirac sea of electrons. Free electrons cannot then 
occupy these filled negative energy states and must 
have nonnegative energy. The existence of this pos- 
tulated Dirac sea has the further consequences that a 
negative energy electron might, through some process, 
become excited into a positive energy state, leaving a 
positively charged “hole” in the sea whose plane-wave 
components act according to the negative energy solu- 
tions <//p W and ip^ w (in a similar way to “holes” appear- 
ing in filled valence bands in semiconductors). Dirac 
originally identified these “positive charge electrons” as 



144 


III. Equations, Laws, and Functions of Applied Mathematics 


protons, despite the fact that a proton has a mass that 
is very different from the electron mass m. ffowever, 
positrons, which appeared to have the same mass as 
electrons but opposite charge, were discovered exper- 
imentally by Anderson in 1932, just four years after 
Dirac’s theory. This prediction of a new kind of par- 
ticle earned Dirac the Nobel Prize in Physics in 1933, 
shared with Schrodinger. 

To incorporate the interaction between an electron 
and the electromagnetic held, the momentum opera- 
tor pj must be replaced with the appropriate canon- 
ical momentum for a charged particle moving in a 
field from the Lorentz force equation (as in classical 
mechanics [IV. 19]). In a scalar potential V and vector 
potential A, giving a 4-potential A a , this replacement in 
(3), (4) requires replacing the usual partial derivatives 
with the gauge-covariant derivative 

d a - Da = da + ieA a , a = 0, 1,2,3. 

Finding self-consistent solutions of Maxwell’s equa- 
tions and the Dirac equation with interactions is analyt- 
ically difficult. Nevertheless, very good approximations 
are possible that agree well with experiment, particu- 
larly in quantum electrodynamics, which is a system- 
atic quantum approach to solving systems with many 
electrons interacting quantum mechanically with the 
electromagnetic field. Quantum electrodynamics, like 
other quantum field theories, requires a more sophis- 
ticated mathematical approach based on the Dirac and 
Maxwell equations and relates the negative energy solu- 
tions of the Dirac and Klein-Gordon equations to pos- 
itive energy states of antimatter (i.e., positrons are 
“anti-electrons”). Despite many mathematical compli- 
cations, this theory successfully describes the evolu- 
tion of many interacting quantum particles, including 
the possibility of their creation and annihilation in a 
fluctuating vacuum. 

Further Reading 

Dirac, P. A. M. 1928a. The quantum theory of the electron. 
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Proceedings of the Royal Society of London A 118:351-61. 
Folland, G. B. 2008. Quantum Field Theory: A Tourist Guide 
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ical Society. 

Lounesto, P. 2001. Clifford Algebras and Spinors, 2nd edn. 

Cambridge: Cambridge University Press. 

Zee, A. 2010. Quantum Field Theory > in a Nutshell, 2nd edn. 
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III. 10 Einstein’s Field Equations 

Malcolm A. H. MacCallum 


Einstein’s general theory of relativity generalizes New- 
ton’s gravity theory to one compatible with special rel- 
ativity. It models space and time points as a (pseudo-) 
Riemannian four-dimensional manifold (see tensors 
and manifolds [11.33]) with a metric g ab of signature 
±2 (the sign choice is conventional). Test particles move 
on space-time’s geodesics. This formulation ensures 
the “weak equivalence principle” or “universality of free 
fall,” which states that free fall under gravity depends 
only on a body’s initial position and momentum. The 
other fundamental part of the theory is Einstein’s field 
equations (EFEs), which relate the metric to the matter 
present. For more on the theory and its applications see 
GENERAL RELATIVITY AND COSMOLOGY [IV.40]. 

By considering a “gedanken” experiment in which 
bodies released at relative rest in a laboratory freely 
falling toward the Earth will appear to move toward one 
another, Einstein recognized that geodesics that are ini- 
tially parallel meet due to gravity. This is described by 
the metric’s curvature. To generalize Newton’s theory, 
the curvature must be related to the space-time distri- 
bution of the energy-momentum tensor of the matter 
content, T ab . 

T ab is assumed to obey T ab -b = 0, the generalization 
to curved space of the Noetherian conservation laws 
obtained when the matter obeys a variational principle. 
Here, b" denotes a covariant derivative v\4th respect 
to the b index, while “, b" will denote a partial derivative 
below. 

The formulas relating the metric, the connection 
F a bc , and the Riemannian curvature, in coordinate 
components, are 

r a be = 2 (dbd.c + ddc,b ~ dbc.dh 

R a bcd = r a b d , c - r a bc4 + r e bd r a ec - r e bc r a ed , 

where g ad is the inverse of g bc . 

Taking a weak-field, slow-motion limit, comparison 
of the geodesic equation with Newtonian free fall iden- 
tifies corrections to an approximating flat (special- 
relativistic) metric with the Newtonian gravitational 
potential <P. One therefore wants to find equations 
relating the second derivative of the metric to T ab . 
Defining the Ricci tensor R ab and the Ricci scalar R by 

Rbd '• — R a bad. R '• = g a ^ Rab » 
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the EFEs 

Gab ■ — Rab ~ ^Rdab = K T a b + Ag a b ( 1 ) 

achieve this relation. Here, A is a constant and G a b is 
called the Einstein tensor. 

To agree with the Newtonian limit, the constant k 
has to be 8ttG/c 4 , where G is the Newtonian constant 
of gravitation and c the speed of light. 

The full curvature can be expressed, in four dimen- 
sions, as 

R ab cd = C* b cd -iR6f c 6 b di + 26%R%, 

where square brackets denote antisymmetrization, so 
that, for any tensor, T [ab] := \(T ah - T ba ). The Weyl 
tensor C a bc d thus defined is conformally invariant and 
describes tidal gravitational forces and gravitational 
waves. Its value depends on distant matter and the 
boundary conditions: it is nonlocally determined by 
certain first-order differential equations, the Bianchi 
identities, with a source given by derivatives of R ab . 

There are a number of other ways to write (1). A 
first-order system arises by taking the metric and its 
connection as variables. Tetrad bases are widely used 
in place of coordinate bases. The EFEs themselves are 
given by a variational principle (assuming that T ab is), 
with the G ab part coming from an action J R^J-g d 4 x, 
where g = det (g ab ). 

Calculating in coordinates, the ten components of (1) 
each have on the order of 10 4 terms in components of 
g a b, they are quasilinear in second derivatives of g ab , 
and they are nonlinear of degree 8 in g ab itself (after 
clearing denominators). The tensorial EFEs are there- 
fore a set of coupled nonlinear inhomogeneous par- 
tial differential equations. Lovelock proved that in four 
dimensions, G a b is the only symmetric divergence-free 
tensorial concomitant of g ab that is linear in second 
derivatives, so the EFEs are the unique field equations 
of this character. 

No general solution is known, except in the sense of 
the integral form given by Sciama, Waylen, and Gilman, 
but many specific solutions have been found. Their 
local geometry can be completely characterized by 
components of the curvature tensor and its covariant 
derivatives. 

General relativity shares with other physical theo- 
ries the property that the evolution is unique given ini- 
tial values of the field and its first derivative. In this 
case these are the induced metric on an initial space- 
like surface, and its derivative off the surface (the first 
and second “fundamental forms”). Four of the EFEs 


constrain these initial values, the remaining equations 
giving six second-order evolution equations. The con- 
straint equations are elliptic and the evolution equa- 
tions hyperbolic; the characteristic speed is that of 
light. 

Existence and uniqueness theorems have been ob- 
tained for the evolution equations in this form. Typ- 
ically, the functions that appear lie in appropriate 
Sobolev spaces. 

As well as such Cauchy problems, the EFEs can be 
studied in a “2 + 2” formalism, where data is given 
on a pair of intersecting two-dimensional characteristic 
surfaces. 

It was recently recognized that the EFEs in standard 
form are only weakly hyperbolic, which explained prob- 
lems that had arisen in numerical integrations, and 
strongly hyperbolic reformulations are now available 
that allow fully four-dimensional numerical computa- 
tions (see NUMERICAL RELATIVITY [V.l 5]). 

In both numerical and analytic studies, important 
problems such as the generation of gravitational waves 
require characterization of isolated bodies. This is 
achieved by defining asymptotic flatness, when at spa- 
tial or light-like (null) infinity the geometry approaches 
that of special relativity’s empty Minkowski space. A 
conformal transformation by a factor Q is used in the 
precise definition; this transformation preserves null 
directions and gives a single point, denoted by t°, at 
spatial infinity. The definition, which is abbreviated as 
AEFANSI (for “asymptotically empty and flat at null and 
spatial infinity”), specifies the behavior of Q and the 
Ricci tensor near i°. 

Space-times may have more than one asymptotic 
infinity (see, for example, the Kruskal diagram in gen- 
eral RELATIVITY AND COSMOLOGY [IV.40]). Weakly 
asymptotically simple (WAS) space-times are those for 
which only one asymptotic region need be considered. 
There is a large class of WAS space-times for which 
null infinity is smooth, for which fields such as the 
Weyl tensor can be expanded in inverse powers of a 
radial distance r, and for which there is an asymp- 
totic symmetry, the Bondi-Metzner-Sachs group. These 
results have been generalized to cases where log r 
terms appear in expansions; it is conjectured that only 
a rather restricted set of cases avoids such terms. 

There is no locally defined energy of the gravita- 
tional field in general relativity; such an energy could 
not be compatible with the local special relativistic 
limit required by the principle of equivalence. How- 
ever, asymptotically flat spaces do possess globally 
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defined energies, using integrals at infinity— the ADM 
energy for a spatial infinity (named after the formal- 
ism introduced by Arnowitt, Deser, and Misner) and the 
Trautman-Bondi energy for a future null infinity— thus 
enabling definition of the total gravitational energy of 
an isolated system. 

The ADM energy has been proved to be nonnega- 
tive — assuming both that the matter that is present 
obeys the dominant energy condition that T ab v a Vb ^ 0 
and that T ab is space-like for all time-like vectors v a — 
and it is zero only in Minkowski space. The Trautman- 
Bondi energy is monotone decreasing with time, as radi- 
ation carries energy away, and agrees with ADM energy 
in its past limit and hence must also be nonnegative 
(as can also be proved directly). These results show in 
particular that gravitational waves carry energy. How- 
ever, the waves cannot carry away more energy than an 
isolated system has initially, as total energy would then 
become negative. The positive energy results also imply 
that there is no analogue of the possibility, present in 
Newtonian gravity theory, that negative gravitational 
potential energy could be greater than the positive 
energy of the field’s source. 

Further Reading 

Griffiths, J., and J. Podolsky. 2009. Exact Space-Times in Ein- 
stein’s General Relativity. Cambridge: Cambridge Univer- 
sity Press. 

Hawking, S., and G. Ellis. 1973. The Large Scale Structure of 
Space-Time. Cambridge: Cambridge University Press. 
Stephani, H., D. Kramer, M. MacCallum, C. Hoenselaers, and 
E. Herlt. 2003. Exact Solutions of Einstein’s Field Equa- 
tions, 2nd edn. Cambridge: Cambridge University Press. 
(Corrected paperback reprint, 2009.) 

Wald, R. 1984. General Relativity. Chicago, IL: University of 
Chicago Press. 

See also http://relativity.livingreviews.org/ for many valu- 
able articles. 


III. 11 The Euler Equations 

P. A. Martin 


The motion of a fluid is well modeled by the navier- 
STOKES equations [III.23]. An underlying assumption 
is that the fluid is viscous (meaning that it is sticky, 
in the sense that there is some resistance to shearing 
motions). Assume further that the fluid is incompress- 
ible (meaning that the density is constant). Then, if 
the viscous effects are removed from the Navier-Stokes 
equations (even though real fluids are always viscous to 
some extent), the result is known as the Euler equations. 


To state them, let x = (x\,\ 2 ,X 3 ) be the position vec- 
tor of a point in the fluid and let u(x, t) = (tti, n 2 ,M 3 ) 
be the fluid velocity at x at time t. Then 

^ + (u ■ V)u+ -Vp = 0, (1) 

at p 

where p(x,t) is the pressure and p is the density. The 
vector equation (1) is to be solved together with the 
incompressibility constraint, which is 

V ■ u = 0. (2) 


There are thus four partial differential equations (PDEs) 
for the four unknowns, Mi, M2, M3, and p. These form 
the incompressible Euler equations for inviscid (zero- 
viscosity) flows; there are also compressible Euler equa- 
tions for flows in which the fluid density is not constant 
but has to be calculated. If body forces are acting (the 
most important of these is gravity), there would be an 
extra term on the right-hand side of (1). 

To clarify the notation used in (1) and (2), we can 
write them in component form: 


dll, v — ■ d it { 1 3 p v — ■ d m j 


where i = 1,2,3. Alternatively, if we use x = (x,y, z) 
and m = (u,v, w), we can write out Euler’s equations 
explicitly: 


du 
~dt 
dv 
Jt 
dw 
~dt + 


■ u 


u 


du du 
V dy 
dv 
V dy 
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0 . 


An important special case arises when the motion is 
irrotational, which means that the vorticity V x u = 0. In 
this case, u = V<f> for some scalar (potential) function 
4>. From (2), V 2 cp = 0; that is, <p solves laplace’s equa- 
tion [III. 18], Moreover, the nonlinear PDEs (1) reduce to 
a formula for p that is known as Bernoulli’s equation. 

Irrotational flows of inviscid incompressible fluids 
have been studied extensively since the nineteenth cen- 
tury. However, it is also known that the underlying 
assumptions are too restrictive in some circumstances 
because they lead to some results that do not agree 
with our experience. Perhaps the most glaring exam- 
ple is the d’Alembert paradox, a mathematical theorem 
asserting that irrotational flow of an inviscid incom- 
pressible fluid about a rigid body generates no drag 
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force on the body. The conventional way to overcome 
the paradox is to bring back viscosity but only inside a 
thin boundary layer [II.2] attached to the body (see 
also FLUID MECHANICS [IV.28 §7.2]). 


The power of this approach (and a related version 
due to Hamilton) is such that much of modern theoret- 
ical physics revolves round a generalization called an 
action. 


III. 12 The Euler-Lagrange Equations 

Paul Glendinning 


The function y(x) with derivative y' = dy/dx that 
maximizes or minimizes the integral 

f F(y,y',x) dx 


with given endpoints satisfies the Euler-Lagrange equa- 
tion 


_d_ 

dx 




( 1 ) 


There are many variants of this equation to deal with 
further complications, e.g., if y or x or both are vectors, 
and more details are given in calculus of variations 
[IV.6], but this simple version is sufficient to demon- 
strate the power and ubiquity of variational problems 
of this form. 

If F = F(y,y') has no explicit x-dependence (x is 
said to be absent), then the Euler-Lagrange equations 
can be simplified by finding a first integral. Using (1) it 
is straightforward to show that 


My'— 

dx v dy' 



= 0 , 


and hence that 



-F = A 


for some constant A. 


( 2 ) 


Application 1: Potential Forces 

Classical mechanics can be formulated as a problem of 
minimizing the integral of a function called the Lagran- 
gian, £, which is the kinetic energy minus the potential 
energy. For a particle moving in one dimension with 
position q (so the dependent variable q plays the role 
of y above and time t plays the role of the indepen- 
dent variable x) in a potential V(q), the Lagrangian 
is L = \mq 2 - V(q) and the Euler-Lagrange equa- 
tion (1) is simply Newton’s law for the acceleration, 
mq = -V' (q) (where the prime denotes differenti- 
ation with respect to q), while the autonomous ver- 
sion (2) shows that \mq 2 + V(q) is constant, which is 
the conservation of energy (see classical mechanics 
[IV. 19]). 


Application 2: The Catenary 

The problem of determining the curve describing the 
rest state of a heavy chain or cable with fixed end- 
points can also be solved using the Euler-Lagrange 
formulation, although the original seventeenth-century 
solution uses simple mechanics. In the rest state the 
chain will assume a shape y = y(x) that minimizes 
the potential energy g \ y d5, where g is the accelera- 
tion due to gravity and s is the arc length along the 
chain. The length of the chain is / d5, and since this 
length is assumed to be constant, L say, J ds = F. This 
acts as a constraint on the solutions of the energy- 
minimization problem and so the full problem can be 
approached by introducing a lagrange multiplier 
[1.3 §10], A. Scaling out the constant g and noting that 
d5 = yj 1 + y' 2 dx, the shape of the curve minimizes 

J y^j 1 + y' 2 dx - A^ { yjl + y' 2 dx - ij. (3) 

(The second term represents the constraint and is zero 
when the constraint is satisfied.) The Euler-Lagrange 
equation with 

F(y,y’ , A) = ^1 + y' 2 - Ay] 1 + y' 2 

can now be used since the A F term of (3) is constant 
with respect to variations in y. The Euler-Lagrange 
equation is supplemented by an additional equation 
obtained by extremizing with respect to the Lagrange 
multiplier, i.e., setting the derivative of (3) with respect 
to A to zero, but this is just the length constraint again. 
Since x is absent, (2) implies that 

(y- A)( M— - Vi + y' 2 ) = A - 

v yjl + y '2 > 

Tidying up the left-hand side and rearranging gives 
A 2 (1 + y' 2 ) = (y - A) 2 . Rewriting this as an expression 
for y' gives a differential equation that can be solved 
by separation of variables to give 



where B is a further constant of integration. This is the 
catenary curve, and the constants are determined by 
the endpoints of the chain and the constraint that the 
total length is L. 



148 


III. Equations, Laws, and Functions of Applied Mathematics 


III. 13 The Gamma Function 

Euler’s gamma function, E, is defined by 

E(x) = f t x ~ 1 e~ t dt, x > 0. 
Jo 

One integration by parts shows that 

E(x + 1) = xT(x), x > 0. 


In their normalized form the equations are 

Qv -iA^ F = (|F| 2 - 1)F, 

curl 2 A = — (F*VF - FVF*) - |F| 2 A, 

2 K 

where i = V^T, k is a material constant known as the 
Ginzburg-Landau parameter, and the asterisk denotes 
complex conjugation. 


A direct calculation gives E( 1) = 1, and then an induc- 
tive argument gives T(n) = (n - 1)! when n is any pos- 
itive integer. For this reason, the alternative notation 
x\ = r(x + 1) is also used. 

Much is known about E and its properties. It is clas- 
sified as a special function [IV. 7] of one variable. 
According to Davis (1959), of all special functions, E 
“is undoubtedly the most fundamental.” It is also ubiq- 
uitous, appearing in countless applications. In com- 
plex analysis [IV. 1], E(z) is defined as a function of a 
complex variable, z, for all z * 0, - 1, -2 

An alternative but equivalent definition of E is 


E(x) = 


e -yx 

x 


00 


n 

n = 1 


n Q x/n 

n + x 


where y = 0.5772 ... is Euler’s constant, defined by 

y = lim (l + ^ + i + ■■• + - - log n) . 
n-° o V 2 3 n ) 

The definition of E as an infinite product shows clearly 

that r(x) is not defined when x = 0, -1,-2,..., and it 

reveals the singular nature at these points. 


1 The Ginzburg-Landau Free Energy 

The Ginzburg-Landau equations arise from minimiz- 
ing the Ginzburg-Landau free energy. For small applied 
magnetic fields the superconducting solution (|F| = 1) 
has a lower energy than the nonsuperconducting (nor- 
mal) solution (F = 0), while for high applied magnetic 
fields the normal solution has the lower energy. At 
the critical magnetic field H c , the two states have the 
same energy, and a normal-superconducting transition 
region is possible. 

The main aim in developing the Ginzburg-Landau 
equations was to determine the energy of such a tran- 
sition region (the so-called surface energy), since this 
determines the scale of the pattern of normal and 
superconducting domains when both are present. In 
particular, it was desirable to demonstrate that the sur- 
face energy was positive. In fact, it turns out that the 
surface energy is positive only if k < 1/V2. Values of 
the Ginzburg-Landau parameter above this threshold 
were dismissed at the time as being unphysical. 


Further Reading 

Davis, P. J. 1959. Leonhard Euler’s integral: a historical 
profile of the gamma function. American Mathematical 
Monthly 66:849-69. 


III. 14 The Ginzburg-Landau Equation 

S. Jonathan Chapman 


The Ginzburg-Landau equations were written down in 
1950 to describe the change of phase of a supercon- 
ducting material in the presence of a magnetic field. 
They are partial differential equations for the complex- 
valued superconducting order parameter F and the 
(real-valued) magnetic vector potential A. The parame- 
ter F can be thought of as a kind of macroscopic wave 
function and is such that | F | 2 is the number density 
of superconducting electrons, while A is such that the 
magnetic field is curl A. 


2 Type-II Superconductors 

A few years later, in 1957, Abrikosov published solu- 
tions of the equations for values of t< > 1/V2 that had 
a quite different structure. When surface energy is neg- 
ative, a normal region shrinks until it is just a point. 
However, the scale of the solution is prevented from 
being infinitely fine by the complex nature of the order 
parameter; each zero of F has a winding number that 
is a topological invariant. Abrikosov’s solutions are of 
the form F = f(r)e lne , A = A(r)eg, where r and 0 are 
polar coordinates, the integer n is the winding number, 
and eg is the unit vector in the azimuthal direction. The 
electric current associated with such a solution is 



which is why these solutions are known as super- 
conducting vortices. Such solutions demonstrate the 
quantum nature of superconductivity on a macroscopic 
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scale. Superconductors with i< > 1/V2 are now known 
as type-II superconductors, while those with i< < 1/^2 
are known as type-I superconductors. 

For type-II superconductors, the critical magnetic 
held H c splits into two critical fields, H c i and H c 2 . 
Below H c 1 the superconducting state is energetically 
preferred, above H c 2 the normal state is preferred, but 
in between H c 1 and H c 2 a mixed state comprising a 
periodic array of superconducting vortices exists. 

Following Abrikosov’s work this lattice of vortices 
was demonstrated experimentally. It is an amazing tri- 
umph of the Ginzburg-Landau theory that it predicted 
the existence of such structures, not only before they 
had been observed experimentally but when the very 
idea of them was disturbing. Almost all technolog- 
ical applications of superconductivity involve type-II 
superconductors in the vortex state. 

Further Reading 

Tinkham, M. 1996. Introduction to Superconductivity. New 

York: McGraw-Hill. 


III. 15 Hooke’s Law 

P. A. Martin 


Strictly, Hooke’s law is not a law, as it is readily and 
frequently violated. Nevertheless, it can be a useful 
approximation, it can be generalized, and it leads to 
ideas of elasticity and constitutive equations. 

In 1678, Robert Hooke (1635-1703) described his 
experiments in which he fixed one end of a long verti- 
cal wire to the ceiling and hung various weights to the 
other end. When there are no hanging weights, the wire 
has length L, say. When a weight of mass m is added, 
the wire extends by an amount £, so that the new length 
is L + £. If the weight is removed, the wire’s length 
returns to L: this is the signature of elastic behavior. 
Hooke also showed that doubling the mass (from m 
to 2m) gives double the extension. He inferred that the 
restoring force exerted by the wire on the mass, F, is 
proportional to the displacement of the mass from its 
equilibrium position, 

F = k£, 

where k is the constant of proportionality, the spring 
constant. This is Hooke’s law. Hooke also did experi- 
ments on the stretching and compression of springs 
and on the lateral deflections of wooden beams. He 


asserted that his law was applicable to “every spring- 
ing body” and that it could be used to understand 
vibrations of such bodies. 

To see this, return to the mass m hanging on the 
wire. In equilibrium, its weight is balanced by the 
restoring force, mg = k£, where g is the accelera- 
tion due to gravity. If the mass is pulled down fur- 
ther and then released, it will oscillate about its equi- 
librium position. In detail, if the mass is displaced 
by an amount x, the downward force on the mass is 
mg - k(£ + x) = -kx, so, by Newton’s second law, 
force = mass x acceleration, the equation of motion 
is -kx = m(d 2 x /dt 2 ). As k = mg/£, we obtain 
d 2 x/dt 2 = -to 2 x, where to = (g/£) 112 . This differ- 
ential equation for x(t) has the general solution x = 
xo cos(cof + 5), where xo and 5 are arbitrary constants. 
The oscillating mass exhibits simple harmonic motion 
with frequency (jo. 

Hooke’s law is an approximation. It models the 
mechanical behavior of his wire and many other elas- 
tic bodies: doubling the load doubles the extension. It 
is a linear approximation, where the constant of pro- 
portionality can be found by experiment; recall that 
k = mg /£. However, there wall be limits to the valid- 
ity of Hooke’s law: if a very large load is applied, the 
wire will extend plastically (which means that the wire’s 
length will not return to L if the load is removed) and 
then it might break. 

Returning to Hooke’s experiments, suppose we hang 
a mass m on a wire of length 21; the extension dou- 
bles to 2£. This implies that k must be proportional 
to I -1 (the left-hand side of mg = k x extension has 
not changed). Similarly, if we suspend the mass by two 
wires in parallel, each of length L, we see half the exten- 
sion; we can say that k must be proportional to A, 
the cross-sectional area of the wire. Thus, we rewrite 
Hooke’s law as 

F _ kL£ 

A~ A L' 

The dimensionless quantity £/L is the extension per 
unit length of wire; it is a measure of strain in the wire. 
The quantity F / A is the force per unit area; it is a mea- 
sure of stress in the wire. The quantity kL/A does not 
depend on L or A; it depends on what the wire is made 
from, so it is a material constant. Therefore, we write 

a = Ee, (1) 

stating that the stress, cr, is proportional to the strain, 
e. The material constant E is known as Young’s modu- 
lus. The formula (1) is a basic assumption in the one- 
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dimensional theory of the strength of materials. Suit- 
ably generalized, it is fundamental to the linear theory 
of elasticity. 

Hooke’s law is a constitutive relation : it is a model of 
how certain materials behave when they are subjected 
to forces. Constitutive relations are part of all contin- 
uum theories. Hooke’s law is useful for elastic materi- 
als, but, as we saw for the oscillating mass, there is no 
damping; the oscillations do not decay with time. Incor- 
porating damping leads to viscoelastic models. For flu- 
ids, it is common to replace Hooke’s law with a rela- 
tion between stresses and velocities; this leads to the 
Navier-Stokes equations. 

Developing and selecting constitutive relations re- 
quires an interplay between good modeling of exper- 
imental observations, essential mathematical proper- 
ties (such as causality and frame indifference), and 
simplicity. 


III. 16 The Korteweg-de Vries Equation 

Willy A. Hereman 


1 Historical Perspective 


real number. Commonly used values are a = ± 1 and 
« = ± 6 . 

The term u t describes the time evolution of the wave. 
Therefore, (2) is called an evolution equation. The non- 
linear term <xuu x accounts for steepening of the wave. 
The linear dispersive term u xxx describes spreading of 
the wave. 

It is worth noting that the KdV equation had already 
appeared in seminal work on water waves published by 
Joseph Boussinesq about twenty years earlier. 

2 Solitary Waves and Periodic Solutions 

The balance of the steepening and spreading effects 
gives rises to a stable solitary wave, 

tu - 4k 3 12 k 2 

u(x,t) = 1 seclr (kx - cot + o), (3) 

oik a 

where the wave number k, the angular frequency to, 
and the phase 6 are arbitrary constants. Requiring that 
lim x - ±co u(x, t) = 0 for all t leads to to = 4k 3 , in which 
case (3) reduces to 

u(x, t) = 12 (k 2 /cx) sech 2 (kx - 4k 3 f + 8). (4) 


In 1895 Diederik Korteweg (1848-1941) and Gustav 
de Vries (1866-1934) derived a partial differential equa- 
tion (PDE) that models the “great wave of translation” 
that naval engineer John Scott Russell had observed in 
the Union Canal in 1834. 

Assuming that the wave propagates in the Jf-direc- 
tion, the evolution of the surface elevation q{X,T) 
above the undisturbed water depth h at time T can be 
modeled by the Korteweg-de Vries (KdV) equation: 


dT 



3 Jgh drj 

2 h n dX 

A _ T \ 8 3 n = Q 


±h 2 Jgh( 


3 pgh 2 ) dX 3 


( 1 ) 


where g is the gravitational acceleration, p is the den- 
sity, and T is the surface tension. The dimensionless 
parameter T / pgh 2 , called the Bond number, measures 
the relative strengths of surface tension and the grav- 
itational force. Equation (1) is valid for long waves of 
relatively small amplitude, \ri\/h «: 1. 

In dimensionless variables, (1) can be written as 


ut + oiuu x + u xxx = 0, (2) 


where subscripts denote partial derivatives. The term 
^ fghqx in (1) has been removed by an elementary 
transformation. Conversely, a linear term in u x can be 
added to (2). The parameter a can be scaled to any 


This hump-shaped solitary wave of finite amplitude 
12k 2 / a travels to the right at constant phase speed 
v = to/fc = 4k 2 , and it models Scott Russell’s “great 
wave of translation” that traveled without change of 
shape over a fairly long distance. 

As shown by Korteweg and de Vries, (2) also has a 
periodic solution: 

, , co - 4k 3 (2m - 1) 

u(x,t ) = 

ak 

+ 12(k 2 /a)m cn 2 (kx - cot + 5; m). (5) 

They called this the cnoidal wave solution because it 
involves Jacobi’s elliptic cosine function, cn, with mod- 
ulus m, 0 < m < 1. In the limit m — 1, cn(g;m) 
sechg and (5) reduces to (3). 

3 Modern Developments 

The solitary wave was, for many years, considered an 
unimportant curiosity in the field of nonlinear waves. 
That changed in 1965, when Zabusky and Kruskal real- 
ized that the KdV equation arises as the continuum 
limit of a one-dimensional anharmonic lattice used by 
Fermi, Pasta, and Ulam in 1955 to investigate how 
energy is distributed among the many possible oscil- 
lations in the lattice. Since taller solitary waves travel 
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faster than shorter ones, Zabusky and Kruskal simu- 
lated the collision of two waves in a nonlinear crys- 
tal lattice and observed that each retains its shape and 
speed after collision. Interacting solitary waves merely 
experience a phase shift, advancing the faster wave and 
retarding the slower one. In analogy with colliding par- 
ticles, they coined the word “solitons” to describe these 
elastically colliding waves. 

To model water waves that are weakly nonlinear, 
weakly dispersive, and weakly two-dimensional, with 
all three effects being comparable, Kadomtsev and 
Petviashvili (KP) derived a two-dimensional version of 
(2) in 1970: 

( u t + 6 uu x + u XX x )x + 3cr 2 Uyy = 0 , ( 6 ) 

where cr 2 = ± 1 and the y-axis is perpendicular to the 
direction of propagation of the wave (along the x-axis). 

The KdV and KP equations, and the nonlinear 
SCHRODINGER EQUATION [111.26] 

iut + u X x + k\u\ 2 u = 0 (7) 

(where i< is a constant and u(x, t) is a complex-valued 
function), are famous examples of so-called completely 
integrable nonlinear PDEs. This means that they can 
be solved with the inverse scattering transform, a 
nonlinear analogue of the Fourier transform. 

The inverse scattering transform is not applied to (2) 
directly but to an auxiliary system of linear PDEs, 

fe + (A + g«u)i|/ = 0, (8) 

<Pt + \au x tp + auipx + 4ip xxx = 0, (9) 

which is called the Lax pair for the KdV equation. Equa- 
tion (8) is a linear Schrodinger equation for an eigen- 
function i p, a constant eigenvalue A, and a potential 
(~<xu) IQ. Equation (9) governs the time evolution of <//. 
The two equations are compatible, i.e., ip xx t = < Ftxx, 
if and only if u(x,t) satisfies (2). For given u(x, 0) 
decaying sufficiently fast as |x| — oo, the inverse scat- 
tering transform solves (8) and (9) and finally deter- 
mines u(x, t). 

4 Properties and Applications 

Scientists remain intrigued by the rich mathemati- 
cal structure of completely integrable nonlinear PDEs. 
These PDEs can be written as infinite-dimensional bi- 
Hamiltonian systems and have additional, remarkable 
features. For example, they have an associated Lax pair, 
they can be written in Hirota’s bilinear form, they admit 
Backhand transformations, and they have the Painleve 
property. They have an infinite number of conserved 


quantities, infinitely many higher-order symmetries, 
and an infinite number of soliton solutions. 

As well as being applicable to shallow-water waves, 
the KdV equation is ubiquitous in applied science. It 
describes, for example, ion-acoustic waves in a plasma, 
elastic waves in a rod, and internal waves in the atmo- 
sphere or ocean. The KP equation models, for exam- 
ple, water waves, acoustic waves, and magnetoelastic 
waves in anti-ferromagnetic materials. The nonlinear 
Schrodinger equation describes weakly nonlinear and 
dispersive wave packets in physical systems, e.g., light 
pulses in optical fibers, surface waves in deep water, 
Langmuir waves in a plasma, and high-frequency vibra- 
tions in a crystal lattice. Equation (7) with an extra lin- 
ear term V(x)u to account for the external potential 
V(x) also arises in the study of Bose-Einstein conden- 
sates, where it is referred to as the time-dependent 
Gross-Pitaevskii equation. 

Further Reading 

Ablowitz, M. J. 2011. Nonlinear Dispersive Waves: Asymp- 
totic Analysis and Solitons. Cambridge: Cambridge Univer- 
sity Press. 

Ablowitz, M. J., and P. A. Clarkson. 1991. Solitons, Nonlinear 
Evolution Equations and Inverse Scattering. Cambridge: 
Cambridge University Press. 

Kasman, A. 2010. Glimpses of Soliton Theory >. Providence, 
RI: American Mathematical Society. 

Osborne, A. R. 2010. Nonlinear Ocean Waves and the Inverse 
Scattering Transform. Burlington, MA: Academic Press. 


III. 17 The Lambert W Function 

Robert M. Corless and David J. Jeffrey 


1 Definition and Basic Properties 

For a given complex number z, the equation 
ive w = z 

has a countably infinite number of solutions, which are 
denoted by (z) for integers k. Each choice of k speci- 
fies a branch of the Lambert W function. By convention, 
only the branches k = 0 (called the principal branch) 
and k = - 1 are real-valued for any z; the range of every 
other branch excludes the real axis, although the range 
of W\ (z) includes (-oo, -1/e] in its closure. Only Wo(z) 
contains positive values in its range (see figure 1). When 
z = -1/e (the only nonzero branch point), there is a 
double root w = -1 of the basic equation we w = z. 
The conventional choice of branches assigns 

W 0 (-l/e) = W-i(- 1/e) = -1 
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Figure 1 Real branches of the Lambert W function. The 
solid line is the principal branch Wo; the dashed line is W- 1 , 
which is the only other branch that takes real values. The 
small filled circle at the branch point corresponds to the 
one in figure 2. 

and implies that lKi(-l/e-if 2 * * * * * ) = -1 + 0 (e) is arbitrar- 
ily close to -1 because the conventional branch choice 
means that the point - 1 is on the border between these 
three branches. Each branch is a single-valued com- 
plex function, analytic away from the branch point and 



Re(Wjk(z)) 

Figure 2 Images of circles and rays in the z-plane under the 
mapsz — Wt ( z ). The circle with radius e “ 1 map s to a curve 
that goes through the branch point, as does the ray along 
the negative real axis. This graph was produced in Maple 
by numerical evaluation of to (x + iy) = Wxiiy) (e x+ly ) first 
for a selection of fixed x and varying y, and then for a 
selection of fixed y and varying x. These two sets produce 
orthogonal curves as images of horizontal and vertical lines 
in x and y under to or, equivalently, images of circles with 
constant r = e* and rays with constant 9 = y under W. 


branch cuts. 

The set of all branches is often referred to, loosely, as 
the Lambert W “function”; but of course W is multival- 
ued. Depending on context, the symbol W(z) can refer 
to the principal branch (k = 0) or to some unspecified 
branch. Numerical computation of any branch of W is 
typically carried out by Newton’s method or a variant 
thereof. Images of Wk(re 10 ) for various k, r, and 0 are 
shown in figure 2. 

In contrast to more commonly encountered multi- 
branched functions, such as the inverse sine or cosine, 
the branches of W are not linearly related. However, by 
rephrasing things slightly, in terms of the unwinding 
number 

2 m 

and the related single-valued function 


lnz is the principal branch of the logarithm, having 
-7T < Im(lnz) ^ 7T. 

The Wright to function helps to solve the equation 
y + In y = z. We have that, if z ^ t ± in for t < — 1, 
then y = to(z). If z = t - in for t < -1, then there is 
no solution to the equation; if z = t + in for t < — 1, 
then there are two solutions: to(z) and to(z - 2m). 

1.1 Derivatives 

Implicit differentiation yields 

W'(z) = e~ w(z) /(l + W(z)) 

as long as W (z) f -1. The derivative can be simplified 
to the rational differential equation 


co(z) := Wx(z)(e z ), 

which is called the Wright to function, we do have the 

somewhat simple relationship between branches that 

Wk(z) = ufilnfc z), where ln/t z denotes lnz + 2nik and 


d W _ W 

d z z(l + W) 

if, in addition, z 0. Higher derivatives fohow natu- 
rally. 
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1.2 Integrals 

Integrals containing W (x) can often be performed ana- 
lytically by the change of variable w = W(x), used in 
an inverse fashion: x = we w . Thus, 

J sinlT(x) dx = J(1 + w)e w sinie die, 
and integration using usual methods gives 
\ (1 + w)e w sinu' - |we“ cos w, 
which eventually gives 

f 2 sinlT(x) dx = (x H X ) sinlT(x) 

J V W(x)J 

- xcoslT(x) + C. 

More interestingly, there are many definite integrals for 
W (z), including one for the principal branch that is due 
to Poisson and is listed in the famous table of integrals 
by D. Bierens de Haan. The following integral, which 
is of relatively recent construction and which is valid 
for z not in (-oo, -1/e], can be computed with spectral 
accuracy by the trapezoidal rale: 

W(z) 1 C n (1 - v cotv) 2 + v 2 , 

= — dv. 

z 2n J-n z + v esc ve~ vcorv 


1.3 Series and Generating Functions 


Euler was the first to notice, using a series due to Lam- 
bert, that what we now call the Lambert W function has 
a convergent series expansion around z = 0: 


W(z) = X 

1 


(-n)”" 1 n 

I — z ■ 

n\ 


Euler knew that this series converges for -1/e ^ z ^ 
1/e. The nearest singularity is the branch point z = 
-1/e. 

W can also be expanded in series about the branch 
point. The series at the branch point can be expressed 
most cleanly using the tree function T(z) = -W(-z) 
rather than W or to, but keeping with W we have 

lVo(-e -1 ~ z2/2 ) = - X (-1 ) n a n z n , 

0 

W_i(-e -1 ~ z2/2 ) = - X a nZ n , 

n^O 


where the a n are given by ao = a\ = 1 and 


Clyi 


1 

( n + l)ai 




k = 2 


These give an interesting variation on Stirling’s for- 
mula [IV.7 §3] for the asymptotics of n\. Euler’s integral 

r 00 

n\ = t"e~ f dt 
Jo 


is split at the maximum of the integrand (t = n), 
and each integral is transformed using the substitu- 
tions t = -nWk(-e~ l - z2/2 ), where fc = 0 is used for 
t ^ n and k = - 1 otherwise. The integrands then 
simplify to f"e~ f = j7 n e~ n e~" z2 /2 and the differentials 
dt are obtained as series from the above expansions. 
Term-by-term integration leads to 

„n + l / 2 \k+l/2 

»i~-^£(2k + i)« 2 k +1 (-) r(k+i), 

k^O 

where F is the gamma function. 

Asymptotic series for z — • oo have been known since 
de Bruijn’s work in the 1960s. He also proved that 
the asymptotic series are actually convergent for large 
enough z. The series begin as follows: Wk(z) ~ In*; (z) - 
ln(lnfc(z)) + o(lnln*;Z). Somewhat surprisingly, these 
series can be reversed to give a simple (though appar- 
ently useless) expansion for the logarithm in terms of 
compositions of W : 

lnz = W (z) + W(W(z)) + W(W(W(z))) + ■■■ 

+ W {N) (z) +lniT (JV) (z) 

for a suitably restricted domain in z. The series 
obtained by omitting the term \nW (N) (z) is not con- 
vergent as AT ->■ oo, but for fixed N if we let z — oo 
the approximation improves, although only tediously 
slowly. 


2 Applications 

Because IF is a so-called implicitly elementary func- 
tion, meaning it is defined as an implicit solution of 
an equation containing only elementary functions, it 
can be considered an “answer” rather than a question. 
That it solves a simple rational differential equation 
means that it occurs in a wide range of mathematical 
models. Out of many applications, we mention just two 
favorites. 

First, a serious application. W occurs in a chemi- 
cal kinetics model of how the human eye adapts to 
darkness after exposure to bright light: a phenomenon 
known as bleaching. The model differential equation is 

An it) = KmOp(t) 

dt T{K m + Op(t))’ 

and its solution in terms of W is 

O p (t) =K m w(-^-e BIK "’- t/T ), 

\ JS-TYi / 

where the constant B is the initial value of O p (0), 
that is, the amount of initial bleaching. The constants 
K m and t are determined by experiment. The solution 
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in terms of W enables more convenient analysis by 
allowing the use of known properties. 

The second application we mention is nearly frivo- 
lous. W can be used to explore solutions of the so-called 
astrologer’s equation, y(t) = ay(t + 1). In this equa- 
tion, the rate of change of y is supposed to be propor- 
tional to the value of y one time unit into the future. 
Dependence on past times instead leads to delay dif- 
ferential equations [1.2 §12], which of course are of 
serious interest in applications, and again W is use- 
ful there in much the same way as for this frivolous 
problem. 

Frivolity can be educational, however. Notice first 
that, if e At satisfies the equation, then A = ae A , and 
therefore A = -Wk(-a). For the astrologer’s equation, 
any function y (t) that can be expressed as a finite lin- 
ear combination y(t) = Z keM Cke~ m ~ a)t for 0 s: f ^ 
1 and some finite set M of integers then solves the 
astrologer’s equation for all time. Thus, perfect know- 
ledge of y(t) on the time interval 0 ^ t ^ 1 is sufficient 
to predict y(t) for all time. However, if the knowledge 
of y (t) is imperfect, even by an infinitesimal amount 
(omitting a single term se-Wni-ait, say, where K is some 
large integer), then since the real parts of -Wk(-ci) go 
to infinity as K — co by the first two terms of the log- 
arithmic series for given above, the “true” value of 
y(t) can depart arbitrarily rapidly from the prediction. 
This seems completely in accord with our intuitions 
about horoscopes. 

Returning to serious applications, we note that the 
tree function T(z) has huge combinatorial significance 
for all kinds of enumeration. Many instances can be 
found in Rnuth’s selected papers, for example. Addi- 
tionally, a key reference to the tree function is a note 
by Borel in Comptes Rendus de I’Academie des Sciences 
(volume 214, 1942; reprinted in his Q-uvres). The gen- 
erating function for probabilities of the time between 
periods when a queue is empty, given Poisson arrivals 
and service time cr, is T(cre _cr z)/cr. 

3 Solution of Equations 

Several equations containing algebraic quantities to- 
gether with logarithms or exponentials can be manip- 
ulated into either the form y + In y = z or we w = z, 
and hence solved in terms of the Lambert W function. 
However, it appears that not every exponential polyno- 
mial equation— or even most of them— can be solved in 
this way. We point out one equation, here, that starts 
with a nested exponential and can be solved in terms 


of branch differences of W\ a solution of 
z + v esc ye~ vcotv = 0 

is v = (Wk(z) - W{(z))l( 2i) for some pair of integers 
k and T; moreover, every such pair generates a solu- 
tion. This bi-infinite family of solutions has accumu- 
lation points of zeros near odd multiples of tt, which 
in turn implies that the denominator in the above def- 
inite integral for W(z)/z has essential singularities at 
v = ±tt. This example underlines the importance of 
the fact that the branches of W are not trivially related. 

Another equation of popular interest occurs in the 
analysis of the limit of the recurrence relation 

a n +i = z a " 

starting with, say, ao = 1. This sequence has a i = 
z, U 2 = z z , «3 = z zZ , and so on. If this limit 
converges, it does so to a solution of the equation 
a = z a . By inspection, the limit that is of interest 
is a = -W(-lnz)/lnz. Somewhat surprisingly, this 
recurrence relation— which defines the so-called tower 
of exponentials— diverges for small enough z, even 
if z is real. Specifically, the recurrence converges for 
e~ e ^ z ^ e 1/e if z is real and diverges if z < 

e~ e = 0.0659880 This fact was known to Euler. The 

detailed convergence properties for complex z were 
settled only relatively recently. Describing the regions 
in the complex plane where the recurrence relation con- 
verges to an n-cycle is made possible by a transforma- 
tion that is itself related to W: if £ = -W( - lnz), then 
the iteration converges if [£| < 1, and also if £ = e 10 
for 6 equal to some rational multiple of tt, say mn /k. 
Regions where the iteration converges to a fc-cycle may 
touch the unit circle at those points. 

4 Retrospective 

The Lambert W function crept into the mathematics lit- 
erature unobtrusively, and it now seems natural there. 
There is even a matrix version of it, although the solu- 
tion of the matrix equation Se 5 = A is not always W ( A) . 

Hindsight can, as it so often does, identify the pres- 
ence of W in writings by Euler, Poisson, and Wright 
and in many applications. Its implementation in Maple 
in the early 1980s was a key step in its eventual 
popularity. 

Indeed, its recognition and naming supports Alfred 
North Whitehead’s opinion that: 

By relieving the brain of all unnecessary work, a good 
notation sets it free to concentrate on more advanced 
problems. 



III. 18. Laplace’s Equation 


155 


Further Reading 


A short list of solutions, V(x,y), for (2) follows: 


Corless, R. M., G. H. Gonnet, D. E. G. Hare, D. J. Jeffrey, and 
D. E. Knuth. 1996. On the Lambert W function. Advances 
in Computational Mathematics 5(11:329-59. 

Knuth, D. E. 2000. Selected Papers on the Analysis of Algo- 
rithms. Palo Alto, CA: Stanford University Press. 

. 2003. Selected Papers on Discrete Mathematics. Stan- 
ford, CA: CSLI Publications. 

Lamb, T., and E. Pugh. 2004. Dark adaptation and the 
retinoid cycle of vision. Progress in Retinal and Eye 
Research 23(31:307-80. 

Olver, F. W. J., D. W. Lozier, R. F. Boisvert, and C. W. 
Clark, eds. 2010. NIST Handbook of Mathematical Func- 
tions. Cambridge: Cambridge University Press. (Electronic 
version available at http://dlmf.nist.gov.) 


III. 18 Laplace’s Equation 

P. A. Martin 


In 1789, Pierre-Simon Laplace (1749-1827) wrote down 
an equation, 


0 , 


( 1 ) 


d 2 V d 2 V d 2 V 
dx 2 + 3 y 2 + 3 z 2 

that now bears his name. Today, it is arguably the 
most important partial differential equation (PDE) in 
mathematics. 

The left-hand side of (1) defines the Laplacian of V, 
denoted by V 2 V or AV: 


V 2 V = AV = V ■ VV = divgrad V. 


Laplace’s equation, V 2 V = 0, is classified as a linear 
homogeneous second-order elliptic PDE for V{x, y, z). 
The inhomogeneous version, V 2 V = /, where / is 
a given function, is known as Poisson’s equation. The 
fact that there are three independent variables (x, y, 
and z) in (1) can be indicated by calling it the three- 
dimensional Laplace equation. The two-dimensional 
version, 


V 2 V 


3 2 V 

3x 2 


3 2 V 

dy 2 


0 , 


( 2 ) 


also has important applications; it is a PDE for V (x,y). 
There is a natural generalization to n independent 
variables. Usually, the number of terms in V 2 V is 
determined by the context. 


1 Harmonic Functions 

Solutions of Laplace’s equation are known as har- 
monic functions. It is easy to see (one of Laplace’s 
favorite phrases) that there are infinitely many different 
harmonic functions. 


• there are polynomial solutions, such as 1, x, y, 
xy, and x 2 - y 2 \ 

• there are solutions such as e“ x cos cxy and e ax x 
sin ay, where a is an arbitrary parameter; and 

• logr, 6, r 01 cos aO, and r“sina0 are solutions, 
where x = r cos 0 and y = r sin 6 define plane 
polar coordinates, r and 9. 

Further solutions can be found by differentiating or 
integrating any solution with respect to x or y; for 
example, (3/3x)logr = x/r 2 is harmonic. One can also 
differentiate or integrate with respect to any parameter; 
for example, 

r<x 2 

I gia) e ax sin ay da 

J a i 

is harmonic, where g(a) is an arbitrary (integrable) 
function of the parameter a and the integration could 
be over the real interval ai < a < a 2 or along a contour 
in the complex a-plane. 

A list of solutions, V(x,y, z), for (1) follows: 

• any solution of (2) also solves (1); 

• there are polynomial solutions, such as xyz and 
x 2 + y 2 - 2 z 2 ; 

• there are solutions such as e yz cos ax cos fly , 
where y 2 = a 2 + /l 2 and a and /I are arbitrary 
parameters; 

• e az J n (ar) cos n9 is a solution, where x = rcosd, 
y = r sin 9, and J n is a Bessel function; and 

• R- 1 is a solution, where R = (x 2 + y 2 + z 2 )~ l/2 is 
a spherical polar coordinate. 


Again, further solutions can be obtained by differenti- 
ating or integrating any solution with respect to x, y, 
z, or any parameter. For example, if i, j, and k are any 
nonnegative integers, 


V(x,y,z) 


d i+j+k x 

dx i dyldz k R 


is harmonic; it is known as a Maxwell multipole. 

As Laplace’s equation is linear and homogeneous, 
more solutions can be constructed by superposition; 
if VT and V 2 satisfy V 2 V = 0, then so does AV 1 + BV 2 , 
where A and B are arbitrary constants. 


2 Boundary-Value Problems 

Although Laplace’s equation has many solutions, it is 
usual to seek solutions that also satisfy boundary con- 
ditions. A basic problem is to solve V 2 V = 0 inside a 
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bounded region D subject to V = g (a given function) 
on the boundary of D, 3 D. This boundary-value prob- 
lem (BVP) is known as the interior Dirichlet problem. 
Under certain mild conditions, this problem has exactly 
one solution. This solution can be constructed explic- 
itly (by the method of separation of variables) when D 
has a simple shape, such as a circular disk or a rectangle 
in two dimensions, and a ball or a cube in three dimen- 
sions. For more complicated geometries, progress can 
be made by reducing the BVP to a boundary integral 
equation around 3 D, or by solving the BVP numerically, 
using finite elements, for example. 

Other BVPs can be formulated. For example, instead 
of specifying V on 3D, the normal derivative of V, 
dV/dn, could be given (this is the interior Neumann 
problem). One could specify a linear combination of V 
and dV /d n at each point on 3D, or one could specify V 
on part of 3D and 3V /3n on the rest of 3D (this is the 
mixed problem). One could specify both V and 3V /3n 
on part of 3D but give no information on the rest of 3D; 
this is called the Cauchy problem. There are also exte- 
rior versions of all these problems, where the goal is to 
solve V 2 V = 0 in the unbounded region outside 3D; for 
such problems, one also has to specify the behavior of 
V “at infinity,” far from 3D. 

3 Applications 

Laplace’s 1789 application of (1) was to gravitational 
attraction and the rings of Saturn. Gravitational forces 
can be written as F = grad V, where V 2 V = 0 outside 
regions containing matter. In general, vector fields that 
can be written as the gradient of a scalar are called 
conservative ; the scalar, V, is called a potential. Equiva- 
lently, conservative fields F satisfy curlF = 0: they are 
irrotational. 

The velocity v for the motion of an incompressible 
(constant-density) fluid satisfies the continuity equa- 
tion, divv = 0. If the motion is irrotational, then there 
exists a velocity potential <fi such that v = grad <fi. The 
continuity equation then shows that V 2 4> = 0. 

Similar equations are encountered in electrostatics 
and magnetostatics. For example, in empty space, the 
electric field, E, satisfies divT = 0 and curl E = 0. Thus, 
we can write E = grad qp, where the potential cp solves 
V 2 c p = 0. 

4 Analytic Function Theory 

In this section we use z = x + iy to denote a complex 
variable. Let f(z) = u(x,y) + i v(x,y) be a function 


of z. If / is analytic (that is, a differentiable function 
of z), u and v satisfy the Cauchy -Riemann equations: 
3 u 3v 3 u 3v 

3x = 3y and 3y = ~~3x' 

Eliminating v between these equations shows that 
u(x,y) satisfies the two-dimensional Laplace equa- 
tion, (2); v(x,v) solves the same PDE. The real and 
imaginary parts of an analytic function are said to be 
harmonic. 

This connection between functions of a complex vari- 
able and two-dimensional harmonic functions leads to 
powerful methods for solving BVPs for (2), especially 
when conformal mappings [II. 5] are employed. 

5 Generalizations 

The Laplacian occurs in PDEs other than Laplace’s 
equation. Here are a few examples: 

V 2 u = c~ 2 (d 2 u/dt 2 ), the wave equation, 

V 2 u = fc _1 (3u/3t), the heat/diffusion equation, 
V 2 u + k 2 u = 0, the Helmholtz equation, 

V 2 w + i f(duldt) = Wu, the Schrodinger equation. 

In these equations, c, k, f, and W are given (the first 
three are often constants) and t is time. One reason that 
V 2 occurs frequently is that space is often assumed to 
be isotropic, meaning that there is no preferred direc- 
tion. Thus, in (1), x, y, and z are Cartesian coordinates, 
but the value of V 2 V does not change if the coordinates 
are rotated or translated. The Laplacian is the simplest 
linear second-order operator with this property. 

There is an anisotropic version of V 2 w, namely, 
div(Agrad u), where A is an n x n matrix with entries 
that could be constants or functions of the n indepen- 
dent variables. If A depends on u, a nonlinear opera- 
tor is obtained. Another well-studied nonlinear variant 
is the p-Laplacian, defined by div( | grad u\ p ~ 2 grad u) 
with 1 < p < oo. 

A useful fourth-order operator is V 2 V 2 = V 4 . The 
biharmonic equation, V 4 it = 0, arises in the theory of 
thin elastic plates, for example. 


III. 19 The Logistic Equation 

Paul Glendinning 


The logistic equation is a simple differential equation 
or difference equation with quadratic nonlinearity. It 
arises naturally in population models [1.5 §3] (for 
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example) as a model of a process with population- 
limited growth, i.e., models that include the inhibitory 
effects of overcrowding. 

The continuous-time version is a differential equa- 
tion for a real variable x that represents the size of a 
population. It is 



although there are higher- dimensional analogues and 
extensions to partial differential equations. The appli- 
cation to population dynamics means that it is usual 
to assume that x(0) = xo > 0. The two parameters are 
the reproduction rate r > 0 (the difference between the 
birth rate and the death rate), and the carrying capacity 
K, which, as shown below, is the equilibrium population 
level. 

The logistic equation is separable and solutions 
may be calculated explicitly using partial fractions 
(see ORDINARY DIFFERENTIAL EQUATIONS [IV.2 §2]). The 
solution is 


v x 0 + (if - xo)e _rt ’ 

so all solutions with xo > 0 tend to the carrying 
capacity, K, as t tends to co. 

The right-hand side of the logistic equation can be 
interpreted as the first two terms of the Taylor series 
expansion of a function / with /( 0) = 0, so the 
equation is a natural model in many other contexts. 

The discrete-time version of the logistic equation is 
often called the logistic map. It is 

x n+ i = px„(l - x„), p > 0. 

This difference equation is one of the paradigmatic 
examples of systems with chaotic attractors. One early 
interpretation is again from population biology: the 
equation may describe the successive population lev- 
els of an organism that has discrete generations, so x n 
is the normalized number of insects (for example) in 
the nth generation, and this depends on the population 
size of the previous generation. 

The logistic map illustrates the different dynamics 
that can exist in general unimodal, or one-hump, maps. 
If 0 < p ^ 4 then all solutions that start in [0, 1] stay in 
that interval, so there is (at least) one bounded attractor 
of the system. The way this attractor varies with p can 
be very complicated. 

If p e (0,1) then the origin is a stable fixed point (if 
xo = 0 then x n = 0 for all n 6 N), and for all initial 
conditions xo e [0, 1], x« - 0 as n - If p e (1, 3) 
then there is a stable nontrivial fixed point (p - 1 )/p 


that attracts all initial conditions in (0,1). At p = 3 
this fixed point undergoes a period-doubling bifur- 
cation [IV.2 1]. The effect of this bifurcation is to cre- 
ate a stable period-two orbit as p increases through 3 
and the fixed point becomes unstable. As p increases 
further there is a sequence of period-doubling bifurca- 
tions at parameter values p n at which orbits of period 
2 n lose stability, and a stable periodic orbit of period 
2 n+1 is created. 

These bifurcation values accumulate geometrically 
at a special value p M , above which there are infinitely 
many periodic orbits in the system and the system 
is chaotic [11.3], although the chaotic set may not 
be attracting. In this chaotic region of parameters 
there are “windows” (i.e., intervals of the parameter) 
for which the attracting behavior is again periodic. 
These orbits then lose stability, undergoing their own 
period-doubling sequences and having a similar bifur- 
cation structure to the original map, but over a smaller 
parameter interval. 

There are many interesting results about the dynam- 
ics in the chaotic region. For example, the set of param- 
eters for which the map has a chaotic attractor has 
positive measure; and between any two parameters 
with chaotic attractors there are parameters with stable 
periodic behavior. 

The order in which periodic orbits are created sat- 
isfies Sharkovskii’s theorem (valid for any continuous 
map of the interval). Sharkovskii’s theorem defines an 
order on the natural numbers that reflects the order in 
which different periods appear in families of maps. The 
Sharkovskii order, -<, is a complete order on the posi- 
tive integers, so for any positive integers p and q with 
p ^ q, either p < q or q < p. The order is defined by 
the list below. To interpret this list, imagine you have 
two positive integers and think about where each one 
appears in the list (they must both appear somewhere!). 
If p appears before q, then p < q. The list is 

1 -< 2 -< 2 2 -< 2 3 -< ■ ■ ■ 

■ ■ ■ ^ 2 n+1 x9< 2 n+1 x7< 2 n+1 x5< 2 n+1 x 3 

■■■<2 n x9<2 n x7<2 n xS<2 n x3 

■■■^9^7^5^3, 

i.e., powers of two ascending, then for each n, 2 n 
times the odds descending, with n descending to zero. 
Sharkovskii’s theorem states that, if a continuous map 
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of the interval has an orbit of period p, then it also has 
orbits of period q for all q < p in this order. 

The logistic equation has different orbits of the same 
period that can be described by labeling the periodic 
points x\ < X 2 < ■ ■ ■ < x p with fixf) = x n (i) for 
some permutation tt of {1 p \ . Sharkovskii’s theo- 

rem does not distinguish between orbits with the same 
period and different permutation type. It turns out that 
the period and associated permutation type define a 
complete order on the periodic orbits arising in the 
logistic equation. 

Further Reading 

Collet, P., and J.-P. Eckmann. 1980. Iterated Maps of the 
Inter\’al as Dynamical Systems. Basel: Birkhauser. 
de Melo, W., and S. van Strien. 1993. One-Dimensional 
Dynamics. Berlin: Springer. 


III. 20 The Lorenz Equations 

Paul Glendinning 


The Lorenz equations provided one of the earliest 
examples of a nonlinear differential equation with 
chaotic behavior. They were derived in the early 1960s 
when the growing use and power of computers raised 
the exciting prospect of greatly improved weather fore- 
casts, and particularly more reliable long-range fore- 
casts. This generated work on improving the numeri- 
cal techniques applied to meteorological models, and 
it also led to the study of simplified models and their 
properties. Saltzman used a crude truncation of the 
Fourier expansion of solutions to convert the par- 
tial differential equations describing convection in a 
horizontal layer into a finite set of coupled ordinary 
differential equations that are significantly easier to 
solve numerically. Saltzman concentrated on a fifty- 
two-mode truncation, but in an article published the 
next year (1963), Ed Lorenz took the model to its over- 
simplified extreme, retaining just three of the Galerkin 
modes. 

Lorenz's model is normally written as a set of cou- 
pled differential equations in three variables, now 
known as the Lorenz equations: 

x = a (v - x), 
y = rx - y - xz, 
z = -bz + xy, 



x 

Figure 1 The Lorenz attractor projected 
onto the (x,z) coordinates. 


where the dot denotes differentiation with respect to 
the independent variable t, and where cr is a normal- 
ized Prandtl number, r a normalized Rayleigh num- 
ber, and b the aspect ratio of the converting cell. The 
parameters used by Lorenz were 

a = 10, r = 28, b= f. 

The insight derived through the investigation of these 
equations has motivated a greater understanding of 
the mathematics of chaos [II. 3] and has had a major 
impact on the way weather forecasts are created and 
reported. A sample trajectory of the Lorenz equations 
at the parameter values given above is shown in fig- 
ure 1. The solution clearly settles down to a bounded 
attracting set (the Lorenz attractor), but it appears 
not to be periodic. Moreover, solutions that start close 
together eventually diverge and behave completely dif- 
ferently, a property now called sensitive dependence 
on initial conditions, which is one of the hallmarks of 
chaos. (A little care is needed here. Sensitive depend- 
ence on initial conditions as usually defined in the 
twentieth century is not a good definition of chaos; 
some exponential divergence in time is necessary in 
more modern definitions (see chaos and ergodic- 
ity [II.3]).) The story goes that Lorenz discovered this 
phenomenon by a fortuitous mistyping of an initial 
condition when checking a modification of his com- 
puter code. Lorenz’s genius was to recognize this as 
a significant property of the solutions. 

The Lorenz attractor motivated several develop- 
ments in bifurcation theory and chaos. In the late 1970s 
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mathematicians developed a simple geometric model 
of the flow that could be reduced to the analysis of a 
one-dimensional map with a discontinuity that could 
be proved to be chaotic. At the same time, researchers 
started to describe the development of the attractor as 
a function of the parameter r, showing that the solu- 
tions in the strange attractor are created as an initially 
unstable set by a global bifurcation [IV. 21]. Spar- 
row brought all these results together, giving a descrip- 
tion of the bifurcations of the Lorenz attractor as r 
is changed, using a mixture of numerical simulations, 
mathematical proofs, and conjectured links between 
the two. 

Sparrow's work provides strong evidence for the fol- 
lowing description of the changes in the attractors of 
the Lorenz equations as r increases. The origin, which 
is a globally attracting stationary point if 0 < r < 1, 
loses stability as r increases through 1 via a pitchfork 
bifurcation that creates a stable symmetric pair of sta- 
tionary points corresponding to convective rolls (see 
bifurcation theory [IV.21 §2] for a description of dif- 
ferent types of bifurcation). At r = ru » 24.74 there 
is a pair of subcritical Hopf bifurcations. The effect of 
this bifurcation is that the stationary points lose stabil- 
ity as r increases through rH, and a pair of unstable 
periodic orbits are created in r < ru- Where did these 
orbits come from? Yorke and coworkers had shown 
in 1979 that the answer lies in a homoclinic bifurca- 
tion at r » 13.926. This creates an unstable chaotic 
set containing infinitely many unstable periodic orbits, 
many of which are destroyed in bifurcations before the 
chaotic set becomes attracting by a mechanism involv- 
ing the two simple periodic orbits that are the orbits 
involved in the Hopf bifurcation at r ~ 24.06. There 
is therefore a brief interval of r values for which a 
complicated attractor coexists with the stable station- 
ary points, and if r is a little greater than ru, the only 
attractor is the chaotic set. This is initially similar to the 
geometric model, but it develops contracting regions 
(“hooks”) as r increases further, which allows for the 
possibility of the creation of stable periodic orbits. At 
very large r the only attractor is a simple symmetric 
periodic orbit, so a sequence of bifurcations destroying 
the orbits of the chaotic set needs to occur. 

Despite a host of theoretical results on the geomet- 
ric models and on the bifurcations in systems such as 
the Lorenz equations, a proof that the Lorenz equa- 
tions at the standard parameter values really do have 
a strange attractor remained open until 2002, when 
Tucker used a combination of rigorous numerically 


computed bounds and mathematical analysis to show 
that the attractor has the chaotic properties required 
(see dynamical systems [IV.20 §4.5] for more details). 

While the mathematical issues were being resolved, 
questions about the physical relevance of the Lorenz 
equations continued to cause controversy. As more and 
more of the ignored modes of Saltzman’s truncated 
model are added back into the equations, the chaotic 
region exists for larger and larger values of r and is no 
longer present in the full partial differential equations. 
More imaginative physical situations, such as convec- 
tion in a rotating hoop, have been devised, and for these 
the Lorenz equations are a good model. 

Even though it is not an accurate model of convection 
in a fluid layer, the physical insight Lorenz brought to 
the problem of weather forecasting was hugely influ- 
ential. As a result of the recognition that sensitive 
dependence on initial conditions can be a problem, 
forecasters now routinely run computer simulations 
of their models with a variety of initial conditions so 
that when it is appropriate they can comment on the 
probability of rain rather than issue a simple state- 
ment that it will or will not rain. The understanding of 
how chaos can be accommodated in nonlinear predic- 
tion has also influenced the way in which sophisticated 
models of the climate and other nonlinear phenomena 
are interpreted. 

Further Reading 

Kaplan, J. L., and J. A. Yorke. 1979. Preturbulence: a regime 
observed in a fluid flow model of Lorenz. Communications 
in Mathematical Physics 67:93-108. 

Lorenz, E. 1963. Deterministic nonperiodic flow. Journal of 
the Atmospheric Sciences 20:130-41. 

Sparrow, C. T. 1982. The Lorenz Equations: Bifurcations, 
Chaos, and Strange Attractors. New York: Springer. 
Tucker, W. 2002. A rigorous ODE solver and Smale’s 14th 
problem. Foundations of Computational Mathematics 2: 
53-117. 


III. 21 Mathieu Functions 

Julio C. Gutierrez-Vega 


Mathieu functions are solutions of the ordinary > Ma- 
thieu equation 

+ (a-2q cos 29) y = 0, (1) 

where q is a free parameter and a is the eigenvalue of 
the equation. Mathieu's equation was first studied by 
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Emile Mathieu in 1868 in the context of the vibrational 
modes of an elliptic membrane. 

For arbitrary parameters (a, q), (1) is a linear second- 
order differential equation with two independent solu- 
tions. In general, these solutions are not periodic func- 
tions and their behavior depends on the initial condi- 
tions y (0) and y'( 0) . Of particular interest are the peri- 
odic solutions with period tt or 2tt. In this case, accord- 
ing to Sturm-Liouville theory, there exists a countably 
infinite set of characteristic eigenvalues a m (q) that 
yield even periodic solutions of (1), and another set of 
characteristic eigenvalues b m (q) that yield odd peri- 
odic solutions of (1). The eigenfunctions associated 
with these sets of eigenvalues are known as the even 
and odd Mathieu functions: 



gee 


y-m — " 


c e m (0;q), 
se m (0;q), 


m = 0,1,2, ... , 
m = 1, 2, 3, ... , 


where m is the order. The notation ce and se comes 
from cosine-elliptic and sine-elliptic, and it is now a 
widely accepted notation for the periodic Mathieu func- 
tions. 

Mathieu functions occur in two main categories of 
physical problems. First, they appear in applications 
involving the separation of the wave equation in elliptic 
coordinates, e.g., the vibrating modes in elliptic mem- 
branes, the propagating modes in elliptic pipes, or the 
oscillations of water in a lake of elliptic shape. Second, 
they occur in phenomena that involve periodic motion, 
e.g., the trajectory of an electron in a periodic array of 
atoms, the mechanics of the quantum pendulum, or the 
oscillations of floating vessels. 

The behavior of Mathieu functions is rather com- 
plicated, and their analysis is difficult using standard 
methods, mainly due to their nontrivial dependence on 
the parameters (a, q). In figure 1 we plot the functions 
c e m (0\q) and se m (0;q) for several values of m over 
the plane ( 6,q ). Note that Mathieu’s equation becomes 
the harmonic equation when q — 0. Evidently, ce m and 
se m converge to the trigonometric functions cos (mO) 
and sin(tn0) as q tends to zero. 

The parity, periodicity, and normalization of the peri- 
odic Mathieu functions are exactly the same as for their 
trigonometric counterparts. That is, ce m is even and 
se m is odd, and they have period tt when m is even or 
period 2 tt when m is odd. The Mathieu functions have 
m real zeros in the open interval 0 e (0, tt), and the 
zeros cluster around tt/2 as q increases. Because the 
Mathieu equation is of Sturm-Eiouville type, the Math- 
ieu functions form a complete family of orthogonal 


Figure 1 The behavior of Mathieu functions over the plane 
(d,q). The range of the plots has been limited to [0, tt], 
since their behavior over the entire range can be deduced 
from the parity and symmetry relations, (a) ceo (9:q). 
(b) cei 10-,q). (c) ce 2 (d;q). (d) sei(0;q). (e) se 2 (0;q). (f) 
se 3 (0-,q). 


functions whose normalization conditions are 

r2n r2n 

I ce m ce„dd= se m se„ d0 = Tr5 m ,„. (2) 

Jo Jo 

If a function f(Q) is periodic with period tt or 2tt, then 
it can be expanded as a series of orthogonal Mathieu 
functions. 

Further Reading 

Mathieu, T. 1868. Le mouvement vibratoire d’une mem- 
brane de forme elliptique. Journal de Mathematiques: 
Pures et Appliquees 13:137-203. 

McLachlan, N. W. 1951. Theory and Application of Mathieu 
Functions. Oxford: Oxford University Press. 


III. 2 2 Maxwell’s Equations 

Mark R. Dennis 


In the early 1860s James Clerk Maxwell wrote down 
a set of equations summarizing the spatial and tem- 
poral behavior of electric and magnetic fields; these 
equations are the foundation of the physical theory 
of electromagnetism. In modern notation they are usu- 
ally written as first-order vector differential equations 
depending on time t and three-dimensional position 
r. With V denoting the gradient operator and a dot 
above a quantity denoting its time derivative, Maxwell’s 
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equations are usually written 


<1 

bs 

II 

© 

(1) 

S7xE + B = 0, 

(2) 

V ■ D = p, 

(3) 

V x H - D = J. 

(4) 

The various electromagnetic fields are 

called the elec- 


trie field E, the electric displacement D, the magnetic 
induction B, and the magnetic field H. E and B are usu- 
ally viewed as more fundamental, and D and H can 
often be expressed as functions of them. Equations (3) 
and (4) are inhomogeneous and depend on sources 
determined by the scalar electric charge density p and 
vector current density J. Equations (l)-(4) may also be 
expressed in integral form, in terms of surface and vol- 
ume integrals. Conventionally, they are referred to as 
Gauss’s law (3), Gauss’s law for magnetism (1) (also 
sometimes referred to as “no magnetic monopoles”), 
Faraday’s law (2), and the Maxwell-Ampere law (4). 

All electromagnetic phenomena are described by 
Maxwell’s equations, including the behavior of elec- 
tricity and electronic circuits, motors and dynamos, 
light and optics, wireless communication, microwaves, 
etc. The phenomena they describe are ubiquitous in 
the modern world, and much of the machinery of 
applied mathematics, such as partial differential equa- 
tions, Green functions, delta functions [III. 7], and 
vector calculus were introduced in part to describe 
electromagnetic situations. Maxwell's equations take 
on various special forms depending on the system 
being considered. For instance, when the fields are 
static (i.e., when time derivatives are zero), the electric 
field around charges resembles the gravitational field 
around masses determined by Newton’s law. 

In free space, all sources are zero, and D = eqE, 
El = p () 1 B, with constants the permittivity of free space 
so = 8.85 x 10~ 12 F nT 1 and the permeability of free 
space po = 4tt x 10~ 7 N A~ 2 , given in conventional SI 
units, originally determined experimentally using elec- 
trical currents, charges, and magnets. Written like this, 
Maxwell’s equations are symmetric in form between E 
and B (with the exception of a minus sign), and can be 
combined to give the d’Alembert equation for E and 
B, propagating at speed c = (eoPo )~ 112 = 2.998 x 
10 8 m s _1 . Maxwell himself originally noticed this fact, 
realizing that c was close to the experimentally mea- 
sured speed of light and therefore that light itself 
is an electromagnetic wave. Plane-wave solutions of 
Maxwell’s equations in free space are E o cos (k ■ r - cot) 


Bo = cos(k-r-cot), where Eg, Bo are constant polariza- 
tion vectors that form a right-handed orthogonal triple 
(Eo,Bo,k) with the wave vector k, and c = to/|k|, 
with co the angular frequency. Free-space solutions of 
Maxwell’s equations are studied systematically in the 
field of optics and photonics [V.14]. 

Many naturally occurring materials without free 
charges (dielectric materials) are linear: D and El are 
linear functions of E and B, but with different per- 
mittivities and permeabilities (possibly in different 
directions). The relative speed of electromagnetic wave 
propagation with respect to c in these materials is 
called the refractive index. Laws of refraction, reflec- 
tion, and transmission can all in principle be derived 
by applying Maxwell's equations to the interface region 
between materials. Other materials can have more com- 
plicated dependence between the various field quanti- 
ties and sources. Some of the simplest nonlinear phe- 
nomena occur in materials where D depends quadrati- 
cally on E. 

The homogeneity of (1) and (2) suggests that E and 
B may themselves be expressed as derivatives of the 
scalar potential V and the vector potential A, via 

B = V x A, E = - VV - A. (5) 

In terms of these potentials, (1) and (2) are automati- 
cally satisfied, and (3) and (4) become second-order dif- 
ferential equations in V and A. As only their derivatives 
are defined by the electromagnetic fields, the absolute 
values of V and A are somewhat arbitrary, and choices 
of potential field may be changed by a gauge transfor- 
mation that preserves the derivative relations in (5), 
A —■ A' = A + Vx, V — • V' = V - x, for some suf- 
ficiently differentiable scalar field x- It is often conve- 
nient to simplify (1) and (2) by reexpressing them in 
terms of the potentials and then choosing a restriction 
on the gauge transformations to be considered (known 
as fixing the gauge), by requiring the potential fields to 
satisfy some extra equation compatible with (5), such 
as V • A = 0 (the Coulomb gauge) or V ■ A + c~ 2 V = 0 
(the Lorenz gauge). 

Maxwell’s equations describe how electric and mag- 
netic fields depend spatially and temporally on elec- 
tric charges and their motion. They do not, however, 
describe directly how the motion of charges depends 
on fields; the force F on a particle of electric charge q 
at position r due to electromagnetic fields is given by 
the Lorentz force law 

F = q(E + r x B). 


( 6 ) 
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The combined theory of electromagnetic fields, charged 
matter, and their mutual dependence is called electro- 
dynamics. 

A particularly interesting aspect of Maxwell’s equa- 
tions from the physical point of view is that they appear 
to apply equally well at many different length scales, 
from the astronomical (used to understand electromag- 
netic radiation in the cosmos) down to the microscopic 
(and even interactions at the scale of subatomic par- 
ticles). Inside materials, at the atomic level, electrons 
and nuclei are separated by free space, and only the 
fundamental fields E and B play a role. In situations 
where length scales are longer, the electric displace- 
ment D and magnetic field H arise through homoge- 
nization [IV.6 §13.3] of length scales. Maxwell’s equa- 
tions can also be made compatible with the laws of 
quantum mechanics [IV.23], both in treatments with 
the field as classical and in treatments in which the 
electromagnetic fields themselves are quantized, with 
their energy quanta two referred to as photons, where 
h is Planck's constant. The combined quantum theory 
of charged matter and electromagnetic fields is quan- 
tum electrodynamics. Gauge transformations take on 
an important role and a new physical interpretation in 
quantum theory. 

Unlike Newton’s equations in classical mechan- 
ics [IV. 19], Maxwell’s equations are not invariant with 
respect to Galilean coordinate transformations. His- 
torically, this provided Einstein with the main moti- 
vation for the theory of special relativity, as provid- 
ing the appropriate set of transformations keeping 
Maxwell’s equations covariant. Maxwell's equations in 
free space can thus be expressed in 4-vector notation 
(which is explained in tensors and manifolds [11.33]), 
in which the electromagnetic field is specified by an 
antisymmetric rank-2 tensor, the Faraday tensor: 


pab 


' 0 —E x /c -Ey/c 
Ex/c 0 —B z 

Ey/C B z 0 

\E z /c —By B x 


~E z /c\ 

By 

-B x 
0 / 


Reexpressed in tensor form, Maxwell’s equations are 


daFbc + dbFca + 3 cEab — 6 , 

3 a F ab = p 0 J b , 

with the first equation corresponding to (1) and (2), 
and the second to (3) and (4), with the current 4-vector 
J b = ( cp,J x ,Jy ,J Z ). These equations transform covari- 
antly under Lorentz transformations. The Lorentz law 


(6) can also be relativistic ally generalized to the equa- 
tion for a force 4-vector f a = qF ab Ub, acting on a 
particle with velocity 4-vector u b . These equations can 
be further generalized to fields in curved space-time, 
in which they play a role in general relativity and 
cosmology [IV.40]. 

Further Reading 
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III. 2 3 The Navier-Stokes Equations 

H. K. Moffatt 


The Navier-Stokes equations are the partial differential 
equations that govern the flow of a fluid (a liquid or a 
gas) that is regarded as a continuum; that is to say, a 
medium whose density held p(x,t ) and (vector) veloc- 
ity held u(x, t ) may be considered to be smooth func- 
tions of position x = (x,y, z) and time t. These equa- 
tions express in mathematical form the physical princi- 
ples of conservation of mass and balance of momentum 
for each small element (or parcel ) of huid in the course 
of its motion. 

1 The Mass-Conservation Equation 

This equation relates p(x,t) and u(x,t) in a very 
simple way: 

Dp dp 

Dt dt up 

Here, the symbol V represents the vector differen- 
tial operator (d/dx, d/dy, d/dz). The operator D/Dt = 
3/3t + u-Vis the Lagrangian derivative (or “deriva- 
tive following the huid”); this consists of two parts, the 
local time derivative 3/3 1 and the convective deriva- 
tive u ■ V. The equation indicates that the density 
of a huid element decreases or increases according 
to whether the local divergence V ■ u is positive or 
negative, respectively. 

An important subclass of hows is described as incom- 
pressible. For these, the density of any element of huid 
is constant: Dp/Dt = 0, and so, from the above, V ■ u = 
0. Attention is focused on incompressible hows in the 
following section. 
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2 The Momentum Equation 


Further Reading 


This equation represents the fundamental Newtonian 
balance 

mass x acceleration = force 


for each fluid element. Of course, density is just mass 
per unit volume, and the (Lagrangian) acceleration of 
a fluid element is Du/Dt = (3/3 1 + u ■ V)u. The left- 
hand side of the following momentum equation should 
therefore come as no surprise: 

+ (u ■ V)u] = - Vp + pV 2 w + /. 



The right-hand side contains three terms that repre- 
sent the forces that may act on a fluid element. The 
term - V p represents (minus) the gradient of the local 
fluid pressure p(x, t); the second term, pV 2 tt, repre- 
sents the net force associated with viscosity p (>0) 
(i.e., internal friction); and the third term, fix, t), rep- 
resents any external force per unit mass acting on the 
fluid. The most common force of this kind is that asso- 
ciated with gravity, g, i.e., / = pg. This is by no means 
the only possibility; for example, in an electrically con- 
ducting fluid in which a current density j flows across a 
magnetic field B, the force per unit mass is the Lorentz 
force given by the vector product f = j x B. 

It might appear from the above that a further equa- 
tion is needed to determine the pressure distribution 
p(x,t). However, this equation is already implied by 
the above; it may be obtained by taking the divergence 
of the momentum equation to give a Poisson equation 
for p. In the simplest case of incompressible flow of 
constant density po, this Poisson equation takes the 
form 


V 2 p = -poV ■ (u ■ V)u = -po 


dui duj 
3 Xj 3 Xi ' 


using suffix notation and the summation convention 
(and, from incompressibility, dui/dXi = 0). 

In the inviscid limit (p = 0), the above equations 
were derived in 1758 by Euler, and they are known as 
the (incompressible) euler equations [III. 11]. The vis- 
cous equations (with p > 0) are named after Claude- 
Louis Navier, who obtained them in 1822 assuming 
a particular “atomic” model to take account of inter- 
nal friction, and George Gabriel Stokes, who in 1845 
developed the more general continuum treatment that 
is still normally used today. A full derivation of these 
Navier-Stokes equations may be found in Batchelor’s 
An Introduction to Fluid Dynamics. 


Batchelor, G. K. 1967. An Introduction to Fluid Dynamics. 
Cambridge: Cambridge University Press. 


III. 2 4 The Painleve Equations 

Peter A. Clarkson 


The six nonlinear second-order ordinary differential 
equations that follow are called the Painleve equations: 


w" = 6iv 2 + z, 
lit" = 2iv 3 + zw + cx, 

„ (iv') 2 w r otic 2 + ft ■) 8 

w = 1 b yw h , 

iv z z w 

w" = — - — t fic 3 + 4 zw 2 + 2(z 2 - cx)w + — , 
2 w w 


\ 2 w iv - 
{w - l) 2 




!"” 1 ,,) 


y iv 5iv( iv + 1) 


w - 1 


V- + — -r + — — W ) 2 

V w iv - 1 i v — zl 
(l 1 1 \ , 

\z z — I iv - z I 


iv(u> - l)(w - z) 


{ 


x j a + 


z 2 (z - l) 2 

fiz y(z - 1) Sz{z - 1) 
iv 2 (w - l) 2 (iv - z) 2 . 


where w = iv(z); a prime denotes differentiation with 
respect to z; and a, ft, y, and 5 are arbitrary constants. 
The equations are commonly referred to as Pi-Pvi in the 
literature, a convention that we v\411 adhere to here. 

These equations were discovered about a hundred 
years ago by Painleve, Gambier, and their colleagues 
while studying second-order ordinary differential equa- 
tions of the form 


iv" = F(z;w,iv'), (1) 

where F is rational in w' and iv and locally analytic 
in z. In general, the singularities of the solutions are 
movable in the sense that their location depends on 
the constants of integration associated with the initial 
or boundary conditions. An equation is said to have the 
Painleve property if all its solutions are free from mov- 
able branch points, i.e., the locations of multivalued sin- 
gularities of any of its solutions are independent of the 
particular solution chosen and so are dependent only 
on the equation; the solutions may have movable poles 
or movable isolated essential singularities. 
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Painleve et al. showed that there are 50 canonical 
equations of the form (1) that have this property, up 
to a Mobius (bilinear rational) transformation. Of these 
50 equations, 44 can be reduced to linear equations, 
solved in terms of elliptic functions, or are reducible to 
one of six new nonlinear ordinary differential equations 
that define new transcendental functions. 

Although first discovered as a result of mathemat- 
ical study, the Painleve equations have arisen in a 
variety of applications including statistical mechanics 
(correlation functions of the XY model and the Ising 
model), random-matrix theory, topological held theory, 
plasma physics, nonlinear waves (resonant oscillations 
in shallow water, convective hows with viscous dissipa- 
tion, Gortler vortices in boundary layers, and Hele-Shaw 
problems), quantum gravity, quantum held theory, gen- 
eral relativity, nonlinear and hber optics, polyelec- 
trolytes, Bose-Einstein condensation, and stimulated 
Raman scattering. The Painleve equations also arise 
as symmetry reductions of the soliton equations, such 
as the Korteweg-de Vries and nonlinear Schrodinger 
equations, which are solvable by inverse scattering. 

The Painleve equations may be thought of as nonlin- 
ear analogs of the classical special functions. They have 
a large number of interesting properties, some of which 
we summarize below. 


(1) For arbitrary values of the parameters tx, /(, y, and 
5, the general solutions of Pi-Pvi are transcendental, 
i.e., they cannot be expressed in terms of closed-form 
elementary functions. 


(2) All Painleve equations can be expressed as the com- 
patibility condition of a linear system: the isomon- 
odromy problem or Lax pair. Suppose that 

d¥ d¥ 

°^-=A(z;A)¥, ^-=B(z; \)¥ 

3A 3z 

is a linear system in which ¥ is a vector, A and B are 
matrices, and A is independent of z. Then the equation 

d 2 ¥ _ d 2 ¥ 

3z3A 3A3z 


is satished provided that 

H + AB - BA = 0, 
3z 3A 

which is the compatibility condition. 


for suitable (nonautonomous) Hamiltonian functions 
3-fj(q,p,z), J = I, II, ... , VI. Furthermore, the func- 
tion cr = 3-fj(q,p,z) satisfies a second-order, second- 
degree equation. For example, the Hamiltonian for Pi 
is 

3-Ci(q,p,z) = j p 2 ~ 2q 3 - zq, 

and so 

q' = p, p' = 6q 2 + z, 
and the function cr = 3~fi(q, p, z) satisfies 

(cr") 2 + 4(cr') 3 + 2zcr' - 2cr = 0. 


(4) Equations Pn-Pvi possess Bdcklund transformations, 
which relate one solution to another solution of the 
same equation, with different values of the parameters, 
or to another equation. For example, if iv = w(z; a) is 
a solution of Pn, then so are 


iv(z; a ± 1) = -w - 
provided that a =t +\. 


2a ± 1 

2 w 2 ± 2w' + z ’ 


(5) For certain values of the parameters, Pn-Pvi pos- 
sess rational solutions, algebraic solutions, and solu- 
tions expressible in terms of classical special func- 
tions (Airy functions for Pn, Bessel functions for Pm, 
parabolic cylinder functions for Piv, confluent hyper- 
geometric functions for P v , and hypergeometric func- 
tions for Pyi). These solutions, which are known as 
“classical solutions,” can often be expressed in the form 
of determinants. For example, Pn has rational solutions 
if rx = n G Z and solutions in terms of Airy functions 
if a = n + j, with n e Z. 


(6) The asymptotic behavior of their solutions— to- 
gether with the associated connection formulas that 
relate the asymptotic behaviors of the solutions as 
| z | -> oo in different regions of the complex plane- 
play an important role in the application of the Painleve 
equations. 


(7) The Painleve equations possess a coalescence cas- 
cade, in that Pi-Pv can be obtained from Pvi by the 
cascade 

Pvi ^ Pv >■ Piv 

Y v 

Pm >■ P n >■ Pi 


(3) Each of the Painleve equations can be written as a 
Hamiltonian system, 


a 


dp 


V = 


arts 

dq 


For example, if we make the transformation 
w (z; a) = euCC,) + e~ 5 , 
z = e 2 ^-6e~ 10 , a = 4f~ 15 
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in Pn, then 


d 2 n 

d ? 2 


6u 2 + X> + f 6 (2 m 3 + E,u). 


So in the limit as e -> 0, u(£) satisfies Pi. 


Further Reading 
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III. 2 5 The Riccati Equation 

Alan J. Laub 


1 History 

The name Riccati equation is given to a wide vari- 
ety of algebraic and differential (or difference) equa- 
tions characterized by the quadratic appearance of the 
unknown X. Such equations are named after the Italian 
mathematician Count Jacopo Francesco Riccati (1676- 
1754). The variable X was a scalar in the original writ- 
ings of Riccati, but in modern applications X is often a 
matrix, and that is the focus here. 

2 The Algebraic Riccati Equation 

The simplest form of the algebraic Riccati equation 
(ARE) arises in the so-called continuous-time linear- 
quadratic theory of control. Suppose we have a linear 
differential equation (with initial conditions) 

= Ax(t) + Bu(t), 
at 

where A e R” x ” and B e R nxm (n ^ m) and the pair of 
matrices (A,B) is controllable, that is, rank(A- AI,B) = 
nforallA e A(A), where A(A) denotes the spectrum of 
A. Suppose, further, that only a subset of the state vari- 
ables x can be measured, namely the outputs y = Cx, 


where C e R‘ JXfl ( q ^ n), and that the pair of matrices 
(C, A) is observable, that is, rank(A - \I, C) = n for all 
A e A(A). The control u(t) is to be chosen to minimize 
the quadratic functional 

r OO 

(y T Qy + u 1 Ru) dt , 

JO 

where Q and R are given weighting matrices that 
are symmetric positive-semidefinite and symmetric 
positive-definite, respectively. It turns out that the opti- 
mal u(t) is given by u(t) = -R~ l B r Xx(t), where the 
symmetric positive-definite matrix X solves the ARE 

A r X + XA- XBR 1 B T X + C t QC = 0. (1) 


The closed-loop matrix A - BR~ 1 B J X (formed by sub- 
stituting the feedback control u(t) above into the dif- 
ferential equation) is asymptotically stable; that is, its 
eigenvalues lie in the open left half-plane. 

The assumptions made in the linear-quadratic prob- 
lem are sufficient to guarantee that the 2n x2n system 
matrix associated with the problem, namely 


A -BR~ 1 B r 
-C t QC -A t 


has no pure imaginary eigenvalues. There is a long his- 
tory of the association between the system matrix H, 
its invariant subspaces, and solutions X of the ARE. 

It is easily shown that the matrix H is Hamiltonian, 
that is, JA is symmetric where J = [ _° : f f " ]. The Hamil- 
tonian structure has the consequence that if A is an 
eigenvalue of H then so is - A with the same multiplic- 
ity. Thus, by our linear-quadratic assumptions, H has 
precisely n eigenvalues in the open left half -plane and 
n eigenvalues in the open right half-plane. Hence we 
can find an orthogonal matrix U e R 2 nx 2 « nx n 
blocks Uij (with Hu nonsingular) that transforms H to 
upper quasi- triangular real schur form [IV. 10 §5.5] 


U J HU = 


Sii 

0 


Su 

S22 


where /l(5ii) is contained in the open left half-plane. 
Setting X = U 21 I/fy , it can then be verified (after a 
modest amount of matrix algebra) that X solves (1) 
and is symmetric positive-definite and that the closed- 
loop eigenvalues are asymptotically stable, since A(A- 
BR^ 1 B r X) = A(Sn). Note that the n columns of [ ] 

are the Schur vectors of H that form a basis for the 
stable invariant subspace (corresponding to the stable 
left half-plane eigenvalues of Sn). Other orderings of 
the eigenvalues give rise to other solutions (even non- 
symmetric ones) of the ARE. The ARE can have infinitely 
many solutions (think of the Riccati equation X 2 = I). 
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There are also results in terms of the singular value 
decomposition [11.32] of Un that explicitly exhibit the 
symmetry of the solution X. 

The foregoing describes the case for the simplest 
linear-quadratic problem. Many of the assumptions can 
be weakened considerably. For example, controllabil- 
ity can be weakened to stabilizability, while observ- 
ability can be weakened to detectability. In this case, 
X is still symmetric and stabilizing but may only be 
positive-semidehnite. 

The connection between the 2nx2n Flamiltonian 
matrix H and the ARE has been known for well over a 
hundred years. The use of Schur vectors, as opposed 
to eigenvectors, as a basis for the stable invariant 
subspace is crucial for reliable numerical solution. 
This (and related topics) has been a fertile subject of 
research. Perhaps surprisingly, the Schur method was 
not published until the late 1970s, but since then meth- 
ods based on Schur vectors have found numerous other 
applications in systems theory, control, and beyond. 

In the discrete-time problem, the differential equa- 
tion is replaced by the difference equation (with initial 
conditions) 

Xk+i = Axk + Bu k , 


where A e R nxn and B e R nxm . As before, yk = 
Cxk, but we will work with C = I for convenience. 
The integral performance constraint is replaced by an 
appropriate summation, and (one form of) the resulting 
discrete-time ARE is given by 

A t XA -X- A t XB(R + B t XB)~ 1 B t XA + Q = 0, 


where the closed-loop matrix A- B(R + B T XB)~ 1 B T XA 
is asymptotically stable (which now means that its 
eigenvalues lie inside the unit circle in the complex 
plane). Provided A is nonsingular, the role of the 
2 n x 2 n Hamiltonian matrix H is taken here by the 
2 n x 2 n matrix 


A + BR- 1 B t A- t Q -BR- 1 B t A- t 

-a- t q a- t 


which is symplectic; that is, S T JS = J. The symplectic 
matrix S has a A « 1/A symmetry to its eigenvalues, 
and appropriate assumptions guarantee that there are 
no eigenvalues on the unit circle. If A is singular, it 
turns out to be much better to work with the 2n x 2n 
symplectic pencil 


' A Cf 


f I BR~ 1 B T l 

-Q. I_ 

- A 

o 

H 


and this gives rise to a large number of new methods 
based on the Schur vectors of the corresponding gener- 
alized eigenproblem. This time the role of the invariant 
subspace is taken by a generalization for matrix pen- 
cils called the deflating subspace, but the essence of 
the method remains the same. 

3 Extensions of the ARE 

There are many generalizations of the linear-quadratic 
problem and the Riccati equation, such as 

• extended (i.e., (2 n + m) x (2 n + m)) pencils, in 
which even the R matrix may be singular; 

• versions with a cross-performance term in the 
integral (or a sum in the discrete-time case); 

• versions for which the system constraint (the dif- 
ferential equation or the difference equation) is 
given in so-called descriptor form with a matrix 
E (which may or may not be singular) multiplying 
the left-hand side; 

• versions associated with the so-called problem 
that are Riccati equations but with different sets of 
assumptions on the coefficient matrices; and 

• AREs with specific structure on the matrices that 
then gives rise to other features of the solution 
(such as componentwise nonnegativity). 

Many other solution techniques have been developed, 
including 

• structured methods that try to preserve the under- 
lying Hamiltonian or symplectic nature of the 
matrices; 

• methods based on the matrix sign function 
[11.14] that exploit the connection with an appro- 
priate invariant subspace; 

• methods that are iterative in nature, such as 
newton’s method [11.28], which involves solv- 
ing a sequence of much-easier-to-solve lyapunov 
[III.28] (or Sylvester) equations at each step; and 

• doubling methods, which constitute another large 
class of iterative methods. 

4 The Riccati Differential Equation 

The ARE can be thought of as a stationary point of the 
associated Riccati differential equation. For the alge- 
braic equation above, its differential equation takes the 
form (with initial conditions) 

4-X(t) = F(X(t)), 
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where F(X(t)) has the form of the left-hand side of 
the ARE (1) and where each of the coefficient matrices 
may also vary with time. Symmetry is not necessary in 
the sense that so-called nonsymmetric equations of the 
form 

f-AXt) = Ai + A 2 X(t) + X(t)A 3 - X(t)A 4 X(t) 

at 

can be studied, where X(t) e M mxn and the coefficient 
matrices A; are general. In all cases, the defining prop- 
erty of equations of Riccati type is the characteristic 
appearance of the unknown matrix in quadratic form. 
Of course, there are also algebraic versions of these 
nonsymmetric equations. 


described by a wave function Fir, t). If the particle is 
subject to a potential V ( r , t), then Schrodinger’s equa- 
tion takes the form of the partial differential equation 

i h^-F(r,t) = --*-V 2 F(r,t) + V(r, t)F(r, t), 
at 2m 

where V 2 denotes the laplace operator [III. 18]. In the 
nonlinear Schrodinger equation, the potential is pro- 
portional to the modulus squared of the wave function: 
V = k \F\ 2 for some positive or negative constant k. 

Study of the evolution of Schrodinger’s equation is 
the subject of quantum mechanics [IV. 23], and it 
emerges naturally from Hamilton’s approach to clas- 
sical mechanics [IV. 19]. 


Further Reading 
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III.26 Schrodinger’s Equation 

In its full generality, Schrodinger’s equation describes 
the quantum mechanical evolution under time t of 
a vector (often called a ket) \'F) in a hilbert space 
[1.2 §19.4], according to a self-adjoint operator H called 
the Hamiltonian operator, 

ihy t \Y)=H\¥), (1) 

where i is the imaginary unit and ft is a constant, known 
as Planck’s constant or the quantum of action. In SI 
units, ft « 1.05 x 10~ 34 J s. 

\F) represents the time-dependent total state of the 
quantum system, and H determines the energy of the 
system. If H is independent of t and \¥) = e~ lEt/h \E) 
for \E) an eigenvector of H with energy eigenvalue 
E, then (1) can be written in time-independent form, 
H\E)=E\E). 

When the quantum system consists of a single parti- 
cle of mass m (disregarding effects of quantum spin 
and electromagnetic interactions), the Hilbert space 
may be chosen to be the space of square-integrable 
complex-valued functions of t and position r in Euclid- 
ean configuration space, and the quantum state is 


III. 2 7 The Shallow-Water Equations 

P. A. Martin 


As one might expect, the shallow-water equations are 
appropriate when considering the motion of a liquid 
occupying a layer. Major applications concern ocean 
waves, so we use appropriate terminology. We there- 
fore identify the bottom of the layer with the sea floor 
at z = -b(x,y), where b(x,y ) is given and positive, 
and x, y, z are Cartesian coordinates with z pointing 
upward. The top of the layer is the moving free surface, 
z = q(x,y, t), where t is time. The total depth of the 
water is h(x,y, t ) = b(x,y) + q(x,y, t). The water is 
assumed to be incompressible (with constant density 
p) and inviscid (viscous effects are ignored). Thus, the 
governing partial differential equations (PDEs) in the 
water are the euler equations [III. 11], 

du , _ , 1 _ „ , 

— + (u ■ V)u + -Vp = -gz, (1) 

dt p 

and the continuity equation, V ■ u = 0, where u = 
(u, v, w) is the fluid velocity, p is the pressure, g is the 
acceleration due to gravity, and z is a unit vector in the 
z-direction. In addition, there are boundary conditions 
at the free surface and at the bottom. They are 

_ , dq dq dq 

p = 0 and „ — t u + v - — = iv at z = q 

dt ox dy 

and 

3 b db n , 

u ^ — i-Vi h iv = 0 atz = -b. 

ox dv 


Integrating V ■ u = 0 with respect to z gives 

rn 

I 7* H 7 4- [ 

dt dx 


dq d ( n 3 

— — t | udz + - — 

3 y . 


v dz = 0, 


( 2 ) 


where the boundary conditions at z = q and z = —b 
have been used. 
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So far, we have not made any approximations. Next, 
we assume that the waves generated are much longer 
than the water depth: this is what is implied by the 
“shallow-water” terminology. One consequence is that 
we can neglect the vertical acceleration terms (involv- 
ing u) in the z-component of (1), giving p(x,y,z, t) = 
pg{q(x,y,t) — z} after integration with respect to z. 

Finally, we assume further that the fluid velocity 
is horizontal and that it does not vary with depth: 
m = (u(x,y,t),v(x,y,t),0). Then, using the horizon- 
tal components of (1) together with (2), we obtain the 
shallow-water equations'. 

3 u 
V dy 
dv 


dll du 

— + — 

ox 


3 1 
dv 

at 
s n 

3 1 


g 


3 n 

dx 


= 0, 


dv 


' dx 
d 


dy 


dq 

— + u— + v— = 0, 


xr + ( hu ) + ^—(hv) = 0. 


dx 


dy 


This is a nonlinear hyperbolic system of PDEs. It can 
be rewritten in other ways; see, for example, how 
it is presented in the article on tsunami modeling 
[V.19 §2]. 

If the motions are small, a linearized version of the 
shallow-water equations can be derived. The result is a 
two-dimensional wave equation [III.31] for q(x,y, t): 



The basic nonlinear shallow-water equations can be 
augmented to include other effects. For example, if 
we want to model global phenomena, we should take 
account of the rotation of the Earth. 


Further Reading 

Vallis, G. K. 2006. Atmospheric and Oceanic Fluid Dynamics. 
Cambridge: Cambridge University Press. 


III.28 The Sylvester and Lyapunov 
Equations 

Nicholas J. Higham 


J 0 “ e At Ce Bt d t exists, then minus that integral is a solu- 
tion of the Sylvester equation. Another is that the block 
upper triangular matrix [ q c b ] can be reduced by a sim- 
ilarity transformation to block-diagonal form [ q ] if 
and only if the Sylvester equation has a solution. 

The Sylvester equation can be generalized by includ- 
ing coefficient matrices on both sides of X and increas- 
ing the number of terms: 

k 

J j A i XB i = C, Ai G C mxm , Bi G C nxn . 

i=l 

When m = n, numerical methods are available that 
solve the system with k = 2 terms in 0(n 3 ) operations, 
but for k > 2 the best available methods require 0(n 6 ) 
operations. In applications in which the coefficient 
matrices are large, sparse, and possibly highly struc- 
tured, much recent research has focused on comput- 
ing inexpensive approximations to the solution using 
iterative methods, with good low-rank approximations 
being possible in some cases. 

The Lyapunov equation is the special case of the 
Sylvester equation with B = A* G C nxn : 

AX + XA* = C. 

It is common in control and systems theory. Usually, C 
is Hermitian, in which case the solution X is Hermitian 
when the equation has a unique solution, which is the 
case when A; + Aj £ 0 for all eigenvalues A* and A y of A. 

A classic theorem says that for any given Hermitian 
positive-definite C the Lyapunov equation has a unique 
Hermitian negative-definite solution if and only if all 
the eigenvalues of A lie in the open left half-plane. The 
latter condition is equivalent to the asymptotic stability 
of the linear system of differential equations x(t) = 
Ax(t), and the Lyapunov equation is correspondingly 
also known as the continuous-time Lyapunov equation. 

The discrete-time Lyapunov equation (also known as 
the Stein equation) is 

X - A* XA = C. 


The Sylvester equation (named after James Joseph 
Sylvester, who introduced it in 1884) is the linear matrix 
equation 

AX + XB = C, 

where A G C mxm , B e C nxn , and C G c mxn are given 
and X e C mxn is to be determined. It has a unique 
solution provided that A and -B have no eigenvalues in 
common. One interesting property is that if the integral 


Given any Hermitian positive-definite matrix C, there is 
a unique Hermitian positive-definite solution X if and 
only if all the eigenvalues of A lie within the unit circle, 
and the latter condition is equivalent to the asymptotic 
stability of the discrete system x^+i = Ax^. 

The Sylvester and Lyapunov equations both arise 
when Newton’s method is used to solve algebraic 
RICCATI EQUATIONS [III.25]. 
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III.29 The Thin-Film Equation 

Andrew J. Bernoff 


The thin-film equation (TFE), 



is a nonlinear fourth-order partial differential equation 
describing the flow of thin viscous films with strong 
surface tension. It has received a great deal of atten- 
tion over the last two decades, both as a tractable model 
of thin fluid films and as a paradigm of challenges in 
the study of partial differential equations. Here, h(x, t) 
represents the nonnegative height of the film’s free sur- 
face (as a function of position x and time t (see fig- 
ure 1)), and Q(h) is a mobility function that is degen- 
erate (Q( 0) = 0) and increasing (Q'(h) > 0 for h > 0). 
The degeneracy allows compactly supported solutions 
(with h = 0 except on some compact subset of the real 
line), and the increasing mobility, which is necessary for 
the problem to be well-posed, leads to solutions that 
are infinitely smooth within their support. The most 
common choice for the mobility is Q(h) = h n for some 
n > 0. 

1 Physical Origins 

The TFE is an example of a lubrication theory, whereby 
a problem is considered in the asymptotic limit where 
horizontal variation occurs on a scale much longer than 
the film thickness. Lubrication theory applied to a thin 
viscous film on a no-slip substrate yields the TFE with 
Q(h) = h 3 ; the introduction of slip on the boundary 
yields Q(h) = fih 2 + h 3 for some /l > 0. 

2 Mathematical Structure 

A mathematical theory for the existence of compactly 
supported weak solutions emerged in the 1990s. The 
TFE conserves mass, 

M = f h dx => d , M = 0, 

J cl t 



X 


Figure 1 A typical compactly supported configuration for 
a thin film of height h(x, t). Variations in surface tension 
forces (proportional to h xx and indicated by the vertical 
gray arrow) create a pressure gradient that drives fluid 
motion (horizontal black arrows). 

and dissipates surface energy (a lubrication approxima- 
tion of the free surface’s arc length), 

E=^(h x fdx * ^~-j(l(h)(h xxx ) 2 dx^0, 

where integrals are over the support of h{x, t). 

It is believed that for n ^ 3 the support of the solu- 
tion cannot increase; this means that the contact line 
(where the solution vanishes) cannot move for a no-slip 
boundary, which reflects the well-known contact line 
paradox in fluid mechanics (where we observe contact 
lines physically moving whenever a fluid wets a solid 
surface, but mathematically it is known that a contact 
line cannot move for the Navier-Stokes equation with 
a no-slip boundary condition). However, for 0 < n < 3 
one can find moving contact line solutions, which allow 
one to model spreading drops. Physically, this reflects 
the fact that the addition of slip to the boundary con- 
dition allows contact line motion, a modeling strategy 
that also resolves the contact line paradox for the full 
fluid equations. 

The question of whether a film can rupture, leaving 
a dry spot (where h = 0), is only partially resolved; 
numerical results suggest that films may rupture for 
n < | and will not rupture for greater n, but rigorous 
analytical results for this problem are lacking. 

For 0 < n < 3 one can find self-similar spreading 
drop solutions that are analogous to the Barenblatt 
solution of the porous media equation. 

3 Higher Dimensions and Generalizations 

The TFE generalizes naturally to N dimensions; the film 
height h(x, t), x e R N , satisfies 

U = -V ■ [Qih)V(Ah ) ], 
at 

and the existence theory for weak solutions, energy dis- 
sipation, contact line motion, axisymmetric spreading 
droplets, and film rupture remains largely unchanged 
from the one-dimensional TFE. 
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Variants of the TFE incorporating gravity, inertia, 
substrate topography, variable surface tension due to 
heat or surfactants (known as Marangoni effects), van 
der Waals forces, evaporation, and many other physi- 
cal effects have been highly successful in modeling thin 
films; the common theme in these models is that the 
additional terms are lower order, and the fourth-order 
thin-film term ensures well-posedness. 

Further Reading 

Bertozzi, A. L. 1998. The mathematics of moving con- 
tact lines in thin liquid films. Notices of the American 
Mathematical Society 45(6):689-97. 

Oron, A., S. H. Davis, and S. G. Bankoff. 1997. Long-scale 
evolution of thin liquid films. Reviews of Modern Physics 
69:931-80. 


III. 30 The Tricomi Equation 

Gui-Qiang G. Chen 


The Tricomi equation is a second-order partial differen- 
tial equation of mixed elliptic-hyperbolic type. It takes 
the following form for an unknown function u (x,y): 

u X x + XUyy = 0. 

The Tricomi equation was first analyzed in 1923 by 
Francesco Giacomo Tricomi when he was studying the 
well-posedness of a boundary-value problem. The equa- 
tion is hyperbolic in the half -plane x < 0 , elliptic in the 
half-plane x > 0, and degenerates on the line x = 0. Its 
characteristic equation is 

d y 2 + xdx 2 = 0 , 
whose solutions are 

y ± |(-x ) 3/2 = C 

for any constant C; the solutions are real for x < 0. 
The characteristics constitute two families of semicubi- 
cal parabolas lying in the half-plane x < 0 , with cusps 
on the line x = 0. This shows that the Tricomi equa- 
tion is of hyperbolic degeneracy in the domain x ^ 
0 , for which the two characteristic families coincide, 
perpendicularly to the line x = 0 . 

For ±x > 0, set t = |(±x) 3/2 . The Tricomi equation 
then becomes the classical elliptic or hyperbolic Euler- 
Poisson-Darboux equation: 

U YT — Uyy T U 7 - ~ 0. 

The index fi = \ determines the singularity of solutions 
near t = 0 (or, equivalently, near x = 0 ). 


Many important problems in fluid mechanics and dif- 
ferential geometry can be reduced to corresponding 
problems for the Tricomi equation, particularly tran- 
sonic flow problems and isometric embedding problems. 
The Tricomi equation is a prototype of the generalized 
Tricomi equation: 

u X x + K(x)u yy = 0, 

where K(x) is a given function with xA'(x) > 0 for 
x^0. For a steady-state transonic flow in R 2 , u(x,y ) 
is the stream function of the flow, x is a function of the 
velocity (which is positive at subsonic speeds and neg- 
ative at supersonic speeds), and y is the angle of incli- 
nation of the velocity. The solutions u(x,y) also serve 
as entropy generators for entropy pairs of the potential 
flow system for the veiocity. For the isometric embed- 
ding probiem of two-dimensional Riemannian mani- 
folds into M 3 , the function K(x) has the same sign as 
the Gaussian curvature. 

A closely related partial differential equation is the 
Keldysh equation: 

XUxx + Uyy = 0. 

It is hyperbolic when x < 0, elliptic when x > 0, and 
degenerates on the line x = 0. Its characteristics are 
given by 

y± |(-x ) 1/2 = C 

for any constant C; the characteristics are real for 
x < 0. The two characteristic families are (quadratic) 
parabolas lying in the half-plane x < 0 ; they coin- 
cide tangentially to the degenerate line x = 0. This 
shows that the Keldysh equation is of parabolic degen- 
eracy. For ±x > 0, the Keldysh equation becomes 
the elliptic or hyperbolic Euler-Poisson-Darboux equa- 
tion with index ft = — by setting t = 3 (±x) 1/2 . 
Many important problems in continuum mechanics 
can also be reduced to corresponding problems for 
the Keldysh equation, particularly shock reflection- 
diffraction problems in gas dynamics. 

Further Reading 

Bitsadze, A. V. 1964. Equations of the Mixed Type. New York: 
Pergamon. 

Chen, G.-Q., and M. Feldman. 2015. Shock Reflection- 
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III. 31 The Wave Equation 


The one-dimensional wave equation is a partial dif- 
ferential equation [IV.3] for u(x, t ), 
d 2 u _ 1 3 2 u 
3^2“c2 3t^’ 

where c is a constant. It is classified as a linear second- 
order homogeneous hyperbolic equation. Unusually, its 


general solution (d’Alembert’s solution ) is known: 

u(x,t) = f (x ~ ct) + g(x + ct). (2) 

Here, f and g are arbitrary twice-differentiable func- 
tions of one variable. 

Small motions of a stretched string are governed by 
(1). Suppose that, in equilibrium, the string is along the 
x-axis. Then u(x,t) gives the lateral displacement at 
position x and time t. The solution u(x,t) = fix - ct) 
represents a wave traveling to the right (x increasing) 
as t increases: a photograph at t = 0 would show the 
string to have displacement u(x, 0) = fix)', a second 
photograph at time 1 /c would show the same displace- 
ment but moved to the right by one unit, u(x, 1/c) = 
fix - 1). Equation (2) shows that the general solution 
consists of two waves, one moving to the right and one 
to the left, both moving at speed c. 




Part IV 

Areas of Applied Mathematics 


IV. 1 Complex Analysis 

P. A. Martin 


1 Introduction 

All calculus textbooks start with fix): / is a func- 
tion of one (real) variable x. Topics covered include 
limits, continuity, differentiation, and integration, with 
the associated notation, such as df/dx = fix) and 
\a fix) dx. It is also usual to include a discussion of 
infinite sequences and series. The rigorous treatment 
of all these topics constitutes real analysis. 

Complex analysis starts with the following question. 
What happens if we replace x by z = x + i y, where 
x and y are two independent real variables and i = 
y'- 1 ? Answering this question leads to rich new fields 
of mathematics; we shall be concerned with those parts 
that are used in applied mathematics. 

Let us begin with basic terminology and concepts. We 
call z = x + iy a complex variable. The imaginary unit 
i should be treated as a symbol that obeys all the usual 
laws of algebra together with i 2 = -1. We call x = Rez 
the real part of z and y = Imz the imaginary part of z. 
We can identify z = x + iy with a point in the xy-plane 
(known as the z-plane or the complex plane). 

The complex conjugate of z is z = x - iy; com- 
plex conjugation is reflection in the x-axis. The abso- 
lute value (or modulus or magnitude) of z is |z| = 
+Jx 2 + y 2 , the distance from z to the origin. Given 
iv = u + iu, we define z + w = (x + u) + i(y + v). 
Addition of complex quantities is therefore equivalent 
to addition of two-dimensional vectors. For multiplica- 
tion, z iv = xu -yv +i(xv + yu). Putting z = iv shows 
that Rez 2 = x 2 - y 2 * x 2 unless z is real (y = 0). Also, 
zz = | z | 2 and z/iv = ziv /\w\ 2 when w * 0. 

Introducing plane polar coordinates, r and 0, we 
have z = rcosd + ir sind = re 10 by Euler’s formula. 
Thus, r = | z | . The angle 0 is called an argument of z, 
denoted by argz or phz (for phase). Notice that argz 


is not unique, as we can always add any integer multi- 
ple of 2tt; this nonuniqueness is sometimes useful and 
sometimes a nuisance. 

If we let r — co , the point z recedes to infinity. It is 
usual to state that there is a single “point at infinity,” 
denoted by z = oo, that is reached by letting r -> oo 
in any direction, 0. Alternatively, we can state that the 
formula z = 1/w takes the point w = 0 to the point 

Z = oo. 

2 Functions 

A function of a complex variable, /(z), is a rule; given 
z = x + iy in some set (the domain of /), the rule pro- 
vides a unique complex number denoted by f(z) = 
u + iv, say, where u = Re/ and v = Imf are real. 
We write f(z) = u(x,y) + i v(x,v) to emphasize the 
dependence on x and y . 

Simple examples of functions are f(z) = z 2 and 
f(z) = z. Elementary functions are defined “natu- 
rally”; for example, e z = e x+ly = e x e ly and cosz = 
| (e lz + e~ lz ). For powers and logarithms, we have the 
formulas z“ = r a e lae (aisreal) andlogz = log(re 10 ) = 
log r + id. Strictly, these do not define functions because 
of the nonuniqueness of 0\ changing 0 by 2 tt does not 
change z but it does change the values of z“ (unless 
a is an integer) and logz. One response to this phe- 
nomenon is to say that logz, for example, is a mul- 
tivalued “function”; increasing 0 by 2rr takes us onto 
another branch or Riemann sheet of logz. Flowever, in 
practice, it is usually better to introduce a branch cut, 
which, for log z, is any line from z = 0 to z = oo. This cut 
is regarded as an artificial barrier; we must not cross it. 
Its presence prevents us from increasing 0 by 2tt. For 
example, we could restrict 0 to satisfy -tt < 0 < tt 
and put the cut on the negative x-axis. Once we have 
restricted 0 to lie in some interval of length 2tt, logz 
and z“ become single-valued; they are now functions. 
We shall have more to say about branches in section 4. 

There are many other ways to define functions. For 
example, /(z) = Xn=o c nZ n is a function provided the 
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power series converges at z; characteristically, power 
series converge in disks, |z| < R, for some R > 0 (R is 
the radius of convergence). The prototype power series 
is the geometric series', it converges inside the unit disk 
where its sum is known: 

00 i 

X z " = ’ l z < L (!) 

n=0 


For another example, take 



dt. (2) 


This defines the Laplace transform of g. Typically, such 
integrals converge for Re z > A, where A is a constant 
that depends on g. Another function defined by an 
integral is Euler’s gamma function: 


T(z) = t z *e f df, Rez>0. (3) 

Jo 

Much is known about the properties of T. For example, 
F (n) = (n - 1)! when n is any positive integer. There 
is more on this in section 1 3 below. 


3 Analytic Functions 


The notions of limit, continuity, and derivative are 
defined exactly as in real-variable calculus. In partic- 
ular, the derivative of / at z is defined by 

/(z + h) - f(z) 


™ - Tz 


lim - 

Ii-O 


h 


(4) 


provided the limit exists. Here, h is allowed to be com- 
plex; the point z + h must be able to approach the 
point z in any direction, and the limit must be the 
same. As a consequence, if /(z) = u(x,y ) + i v(x,y) 
has a derivative, /'(z), at z, then u and v satisfy the 
Cauchy -Riemann equations: 

du dv du dv 

dx 3 y 3n 3 y dx ' 


(5) 


If these are not both satisfied, then /'(z) does not 
exist. Two examples: /(z) = z is not differentiable for 
any z; and any real-valued function /(z) = u(x,y) 
is not differentiable unless u is a constant. If both 
Cauchy-Riemann equations (5) are satisfied and the 
partial derivatives in (5) are continuous functions, then 
/' exists. 

Using (5), if /' exists, then 

, du .dv dv .du 

fiz)= dx +1 dx = dy- 1 dy 
du .du dv .dv 

dx 1 dy dy + l dx' 

The first equality follows by taking h to be real in (4) and 
the second by taking h to be purely imaginary. The four 


formulas for /' show that we can calculate /' from Re / 
or Im /, or by differentiating with respect to x or y. 

Differentiability is a local property, defined at a point 
z. Usually, we are interested in functions that are differ- 
entiable at all points in their domains. Such functions 
are called analytic or holomorphic. Points at which a 
function is not differentiable are called singularities. 

Derivatives of higher order (such as /"(z)) are 
defined in the natural way. One surprising fact is that a 
differentiable function can be differentiated any num- 
ber of times; once differentiable implies infinitely dif- 
ferentiable (see (13) below for an indication of a proof). 
This result is certainly not true for real functions. 

If we eliminate v from (5), we obtain 


3 2 u d 2 u 
dx 2 + dy 2 


( 6 ) 


Thus, the real part of an analytic function, u(x,y), 
satisfies laplace’s equation [III. 18] (6). The imagi- 
nary part, v(x,v), satisfies the same partial differ- 
ential equation (PDE). This reveals a close connection 
between analytic functions and solutions of one partic- 
ular PDE. As Laplace’s equation arises in the modeling 
of many physical phenomena, this connection has been 
exploited extensively. 


4 More on Branches 

Let us return to log, which we can define as a (single- 
valued) function by 

logz = logr + W, r > 0, - tt < 0 < tt, (7) 

with z = re 10 . There is a branch cut along the nega- 
tive x-axis, with a branch point at z = 0 and a branch 
point at z = oo. Thus, our domain of definition for 
log z is the cut plane, i.e., the whole complex plane with 
the cut removed. Then, log z is analytic; it is differen- 
tiable at all points in its domain of definition. Moreover, 
(d/dz) logz = z _1 . 

According to (7), logz is not defined on the negative 
x-axis. Some authors regard this as unacceptable, and 
so they replace the (open) interval -tt < 0 < n in 
(7) by -tt < 0 ^ tt or -tt ^ 6 < tt. The first choice 
gives, for example, log (-1) = in and the second gives 
log ( — 1 ) = -in. Either choice enlarges the domain of 
definition to the whole plane with z = 0 removed. How- 
ever, we lose analyticity; logz is not differentiable on 
the line 0 = tt (first choice) because points on that 
line are not accessible in all directions (as they must 
be if one wants to compute limits, as in the definition 
of derivative) without leaving the domain of definition. 
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For some applications this may be acceptable, but, in 
practice, it is usual to simply move the cut. We can 
therefore replace -tt < 9 < tt in (7) by another open 
interval, do < 9 < 2tt + 9o, implying a cut along the 
straight half -line 9 = 9o, r ^ 0. (In fact, the cut need 
not be straight; any line connecting the branch point at 
z = 0 to z = oo may be used.) Then log z is analytic in 
a new cut plane. 

Once we define a function such as logz or z 1/2 with 
a specified range for 9, we can say that we have defined 
a principal value of that function. Certain choices (such 
as —tt<9<tt or —tt<9^tt) are common, but the 
reader should not overlook the option of moving cuts 
when it is convenient to do so. 

There is another consequence of insisting on hav- 
ing (single-valued) functions: some standard identities, 
such as log(z 2 ) = 2 logz, may no longer hold. For 
example, with z = -1 + i and the definition (7), we find 
log(z 2 ) = log2 - *iTt but 21ogz = log2 + |iTT. 

Summarizing, functions with branches are very com- 
mon (for another example, see the article on the Lam- 
bert W-function [111.17]) but their presence often 
leads to complications, subtle difficulties, and calcula- 
tional errors; care is always required. 


zo to the nearest singularity of f(z). A Taylor expan- 
sion about the origin (zo = 0) is known as a Maclaurin 
expansion. All these expansions are the same as those 
occurring in the calculus of functions of one real vari- 
able. For example, (1) gives the Maclaurin expansion 
of 1/(1 - z). Another familiar Maclaurin expansion is 
e z = Xn= o z " ln\, which is convergent for all z. 

A generalization of Taylor’s theorem, Laurent’s the- 
orem, will be given in section 8. 

Not all infinite series are power series. A famous 
series is the riemann zeta function [IV.7 §4], which 
is defined by 

£(z) = X f° r ^ ez > 1’ (9) 

n = 1 ' 

which is intimately connected with the distribution of 
the prime numbers. 

It is possible to develop the theory of analytic func- 
tions by starting with power series; this approach, 
which goes back to Weierstrass, has a constructive fla- 
vor. We started with the notion of differentiability; this 
approach, which goes back to Riemann and Cauchy, is 
closer to real-variable calculus. The two approaches are 
equivalent; power series define analytic functions and 
analytic functions have power-series expansions. 


5 Infinite Series 


6 Contour Integrals 


A power series about the point zq has the form 


^c«(z-zo)", (8) 

n= 0 

where the coefficients c n are complex numbers. The 
series (8) converges for |z - zol < R and diverges for 
|z - zol > R , where the radius of convergence, R , may 
be finite or infinite. (It may happen that (8) converges 
at z = zo only, with sum Co.) When the series does con- 
verge, we denote its sum by S(z). For an example, see 
the geometric series (1). 

The sumS(z) is analytic for |z-zol < R; power series 
define analytic functions. 

Now we turn this around. We take an analytic func- 
tion, /(z) , and we try to write it as a power series. Doing 
this is familiar from calculus, and the result is Taylor’s 
theorem: 


n = 0 


where / <fl) is the nth derivative of /. The series is 
known as the Taylor expansion of /(z) about zo- It con- 
verges for |z - zol < R , where R is the distance from 


In the calculus of functions of two real variables x and 
y, double integrals over regions of the xy-plane and 
line integrals along curves in the xy-plane are defined. 
In complex analysis, we are mainly concerned with inte- 
grals along curves in the z-plane. They are defined sim- 
ilarly to line integrals. Thus, suppose that points on a 
curve C are located by a parametrization, 

C: z(t) = x(t) + ir(t), a^.t^b, 

where a and b are constants and x(t) and y (t) are real 
functions of the real variable t.Ast increases from a to 
b, z(t) moves from z(a) to z(b)\ the parametrization 
induces a direction or orientation on C. The curve C 
is smooth if x'(t) and y'(t) exist and are continuous. 
Then, if f(z) is defined for all points z on a smooth 
curve C, 

f / (z) dz = f / (z(t))z' (t) dt, (10) 

JC Ja 

where z'(t) = x'(t) + iy'(t). In (10), the right-hand 
side defines the expression on the left-hand side as an 
integration with respect to the parameter t. More gen- 
erally, suppose that C is a contour, defined as a continu- 
ous curve made from smooth pieces joined at corners. 
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Then, to define / c /dz, we parametrize each smooth 
piece separately, and then sum the contributions from 
each piece, ensuring that the parametrizations are such 
that z moves continuously along C. 

If C 0 is the same curve as C but traversed in the oppo- 
site direction (from z(b) to z(a)), then J Co /(z)dz = 
- \ c f(z) dz; changing the direction changes the sign. 


7 Cauchy’s Theorem 


A contour C is closed if z(a) = z{b), and it is simple 
if it has no self-intersections. Cauchy’s theorem can be 
stated as follows. Suppose that /(z) is analytic inside 
a simple closed contour C and continuous on C. Then 


J /(z)dz = 0. (11) 

It is worth emphasizing the hypotheses. First, we do 
not need to know anything about /(z) outside C; 
Cauchy assumed stronger conditions, but these were 
later weakened by Goursat. Second, C is a contour, so 
corners are allowed. Third, by requiring that “/ is ana- 
lytic inside C,” we mean that /(z) must be differen- 
tiable at all points z inside C; singularities (including 
branch points) are not allowed (although they may be 
present outside C). 

There are many consequences of Cauchy’s theorem. 
One is known as deforming the contour. Suppose that 
C i and C 2 are simple closed contours, both traversed 
in the same direction, with Ci enclosed by C 2 . Suppose 
that /(z) is analytic in the region between Ci and C 2 
and that it is continuous on Ci and CS. (Note that f 
may have singularities inside the smaller contour Ci 
or outside the larger contour C 2 .) Then, J Cl /(z) dz = 
f Cz f(z) dz; one contour can be deformed into another 
without changing the value of the integral, provided the 
integrand is analytic between the contours. The same 
result is true when Ci and C 2 are two contours with the 
same endpoints, provided / is analytic between Ci and 
C 2 . These results are useful because they may allow us 
to deform a complicated contour into a simpler contour 
(such as a circle or a straight line). 

Another consequence of Cauchy’s theorem is the 
Cauchy integral formula. Under the same conditions, 
we have 


f(z 0 ) 


jM T<5l dz , 

2tti jc z - zq 


( 12 ) 


where zo is an arbitrary point inside C, and C is 
traversed counterclockwise. This shows that we can 
recover the values of an analytic function inside C from 
its values on C. 


More generally, and again under the same conditions, 
we have 


/ <n) (zo) 


n! f f(z) 
2rri Jc (z - zq )” +1 


dz. 


(13) 


Formally, this can be seen as the nth derivative of the 
Cauchy integral formula (12), but it is deeper; it can 
be used to prove the existence of f {n \ for n = 2,3,..., 
assuming that /' exists. This is done using an inductive 
argument. We have (compare with (4)) 


/ (n+1) (zo) = lim 
h—0 


/ <n) (z 0 + h) - f (n) (zp) 
h 


provided the limit exists. Now, on the right-hand side, 
use (13) twice; the limit can then be taken. 

Formula (13) with n = 1 can be used to prove Liou- 
ville’s theorem. Suppose that f(z) is analytic every- 
where in the z-plane (that is, there are no singulari- 
ties); such a function is called entire. Suppose further 
that |/(z)| < M for some constant M and for all z; we 
say that f is bounded. Liouville’s theorem states that 
a bounded entire function is necessarily constant. In 
other words, (nonconstant) entire functions must be 
large somewhere in the complex plane. For example, 

|cosz| 2 = cos z cos z = |(e lz + e~ lz )(e lz + e -lz ) 

= l(e 2ix + e~ 2ix + e 2y + e~ 2y ) 

= ^ (cos 2x + cosh2y) = cos 2 x + cosh 2 y — l 


using z + z = 2x and z - z = 2r y. Thus, |cosz| grows 
rapidly as we move away from the real axis (where y = 
0 and coshO = 1). 


8 Laurent’s Theorem 

Suppose that /(z) is analytic inside an annulus, a < 
\z- zq\ < b, centered at zo. We say nothing about f(z ) 
when z is in the “hole” of radius a (|z - Zol < a) or 
when z is outside the annulus ( | z — zo I > b). We then 
have the Laurent expansion 
00 

/(z) = Y. Cn(z-Z 0 ) n (14) 

U —-00 

for all z in the annulus, where the coefficients are given 
by contour integrals, 

Cn = X— f f(Z) dz, (15) 

27T1 Jc (z - Zo) n+l 

in which C is a simple closed contour in the annulus 
that encircles the hole (once) in the counterclockwise 
direction. Note that the sum in (14) is over all n. It is 
often convenient to split the sum, giving, for all z in the 
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annulus, 

00 CO J 

f(z) = X a n (z-zo) n + X )„ ■ (!6) 

n=0 n = l 1 

where = c n for n = 0,1,2,... and b n = C- n for 
n = 1,2 In particular, 

bi = -^\ f(z)dz. (17) 

2tti Jc 

Suppose that f(z) is also analytic in the hole, so 
that f(z) is analytic in the disk \z - zq\ < b. Then 
b„ = 0 (n = 1,2,...) by Cauchy’s theorem and a n = 
f w (zo)/n\by (13); Taylor’s theorem is recovered. Note 
that, in general, when / does have singularities in 
the hole, we cannot use (13) to evaluate the contour 
integrals defining a n . 


9 Singularities 


A singularity is a point at which a function is not dif- 
ferentiable. There are several kinds of singularities. A 
point zo is called an isolated singularity > if there is an 
annulus 0 < \z - Zol < b (a “punctured disk”) in which 
there are no other singularities. In this annulus, we have 
a Laurent expansion, (16). The first part (the sum over 
a n ) is a power series, and so it defines an analytic func- 
tion on the whole disk. The singular behavior resides 
in the second sum (over b n )\ it is called the principal 
part , P(z). In practice, P{z) often has a finite number 
of terms, 


P(z) 


b\ 

z- z 0 


b 2 

(z - z 0 ) 2 


+ ■ ■ ■ + 


bm 

(z - Z 0 ) m ’ 


(18) 


with b n = 0 for all n > m and b m =#= 0. In this situation, 
we say that / has a pole of order m at zo. A pole of 
order 1 is called a simple pole and a pole of order 2 is 
called a double pole. For example, all the following have 
simple poles at z = 0: 


1 1 + z e z sinz tt _ ^ 

z’ z z ’ z 2 sin tt z ' 
the last in this list also has simple poles at z = ±1, ±2, 
All the following have double poles at z = 0: 


1 + z 


1 


. . — H — ■ (20) 

z~ z c zsinz z z sin ttz 
If the principal part of the Laurent expansion con- 
tains an infinite number of nonzero terms, zq is called 
an isolated essential singularity. For example, e 1/z has 
such a singularity at z = 0. 

The coefficient by (given by (17)) will play a special 
role later; it is called the residue of / at the isolated 
singularity, Zq, and it is denoted by Res [/; zq]. 


There are also nonisolated singularities. The most 
common of these occur at branch points. For example, 
/(z) = z 1/2 has a branch-point singularity at z = 0. 
Note that any disk centered at z = 0 will include a piece 
of the branch cut emanating from the branch point; f is 
discontinuous across the cut, so it is certainly not dif- 
ferentiable there, implying that z = 0 is not an isolated 
singularity. 

10 Cauchy’s Residue Theorem 

If we want to evaluate I = j c f(z ) dz, the basic method 
is to parametrize each smooth piece of C and then use 
the definition (10). In principle, this works for any f 
and for any C. However, in practice, C is often closed 
and / is analytic apart from some singularities. In these 
happy situations, we can calculate I efficiently by using 
Cauchy’s residue theorem. Thus, suppose that /(z) is 
analytic inside the simple closed contour C (and con- 
tinuous on C) apart from isolated singularities at Zj, 
j = 1,2 n. (Note that / may have other singulari- 

ties, including branch points, outside C, but these are 
of no interest here.) Then 

n 

/(z)dz = 2rri X Res[/;Zj], (21) 

Jc t=i 

where C is traversed counterclockwise. This important 
result is remembered as “2Tti times the sum of the res- 
idues at the isolated singularities inside the contour.” 
If there are no singularities inside, we recover Cauchy’s 
theorem (11). 

To prove the theorem, we start with the case n = 1. 
There is a Laurent expansion about the sole singularity 
zi , convergent in a punctured disk 0 < | z - z\ \ < b. We 
deform C into a smaller contour (enclosing z{) that is 
inside the disk. Then we use (17). In the general case, 
we deform C and “pinch off,” giving a sum of n contour 
integrals, each one containing one singularity. 

To understand the pinching-off process, suppose 
that n = 2 and deform C into a dumbbell-shaped con- 
tour, with two circles joined by two parallel straight 
lines, Li and L 2 , traversed in opposite directions. The 
contributions from Ly and L 2 cancel in the limit as 
the lines go together, leaving the contributions from 
disjoint closed contours around each singularity. This 
process is readily extended to any (finite) number of 
isolated singularities. 

In order to exploit the residue theorem, we need 
efficient methods for computing residues. Recall that 
Res[/; zq] is the coefficient by in the Laurent expansion 
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about z o (see (18)). For simple poles, h\ is the only non- 
trivial coefficient in the principal part; thus, at a simple 
pole z o, 

Res[/;z 0 ] = lim{(z-z 0 )/(z)}. (22) 

Z — 2o 

Often, simple poles are characterized by writing f(z) = 
p(z)/q(z) with q(z 0 ) = 0, p(z 0 ) * 0, and q' (z 0 ) * 0. 
Then 

Res [/;zo] = p(zo)/q'(zo). (23) 

For a pole of order m, we can use 

Res[/;z 0 ] = — lim d m y {(z ~ z 0 ) m f{z)}. 
m\ z-zo dz m_1 

However, it is sometimes quicker to construct the Lau- 
rent expansion directly and then to pick off b \ , the coef- 
ficient of 1 / (z - zo). Thus, almost by inspection, all the 
simple-pole examples in (19) have Res[/;0] = 1. The 
five double-pole examples in (20) have Res[/; 0] = 0, 1, 
0, 0, and 0, respectively. 

1 1 Evaluation of Integrals 

Cauchy’s residue theorem gives a powerful method for 
evaluating integrals. We give a few examples. 

Let C be a circle of radius a centered at the origin 
and traversed counterclockwise. The function e z /z has 
a simple pole at z = 0 with residue = 1. Hence 

f — dz = 2 ttl (24) 

Jc z 

This result could have been obtained from the Cauchy 
integral formula (12) with /(z) = e z and z o = 0. Note 
also that the value of the integral does not depend 
on a\ this is not surprising because we know that we 
can deform C into a concentric circular contour (for 
example) without changing the value of the integral. 

Now, starting from the result (24), suppose we para- 
metrize C and then use (10); a suitable parametrization 
is z(t) = ae 1( , -7T ^ t ^ 7T. As z'(t) = i ae lt = i z(t), we 
obtain 

C n 

| exp(ae lf ) df = 2 tt. 

J — 7T 

By Euler’s formula, e 10 = cos 0 + i sin 0, the integrand 
is e acost cos (asint) + ie acost sin(asint). The second 
term is an odd function of t and so it integrates to zero, 
leaving 

f gacost cos ( a sint) dt = tt. (25) 

Jo 

Thus, from the known value of a fairly simple contour 
integral, (24), we obtained the value of a complicated 
real integral. Notice that the formula (25) was derived 
by assuming that the parameter a is real and positive. 


In fact, it is valid for arbitrary complex a; this is an 
example of analytic continuation (see section 13). 

We now consider doing the opposite: evaluating inte- 
grals by converting them into contour integrals, fol- 
lowed by use of the residue theorem. 

For trigonometric integrals such as 

_ f 27T dfc> 

1 Jo 5 + 4cos0’ 

the substitution z = e 10 will convert Ji into a contour 
integral around the unit circle, |z| = 1. Using dd/dz = 
1 / (iz) and cos 0 = \(z + z~ 4 ), we obtain 

, If dz 

1 2i J |z| =i (z + 2) (z + \) ’ 

The integrand is analytic apart from simple poles at 
z = -2 and z = — 5 ■ The latter is inside the contour; its 
residue is § (use (22)). Cauchy’s residue theorem (21) 
therefore gives h = |tt. 

The method just described requires that the range of 
integration for 0 have length 2 t t and that the resulting 
integrand have only isolated singularities (not branch 
points) inside |z| = 1. 

For a second example, consider 


h = f fix) dx with / (x) = 1 . 

J — 00 X ^ + 1 

In order to use the residue theorem, we need a closed 
contour C, so we try J c /(z)dz with C consisting of 
a piece of the real axis from z = -R to z = R and a 
semicircle Cr in the upper half-plane of radius R and 
centered at z = 0. Then 



dx + 



dz = 2m x 


residues 
at poles (26) 
inside C. 


After calculation of the residues, we let R -> oo, so that 
the first integral -> h- We will see in a moment that the 
second integral — 0 as R — oo. 

Now, z 4 + 1 = 0 at z = z n = exp(i(2n + 1 )tt/4), n = 
0, 1, 2, 3. These are simple poles of /(z) with residue 
1/(4 z^) (use (23) with p = 1, q = z 4 + 1). The poles zo 
and z\ are in the upper half-plane. Hence the right-hand 
side of (26) is 7 t/^/ 2, and this is Jo. 

For z G Cr we parametrize using z(t) = Re lt , 0 ^ t ^ 
tt. We see that /(z) decays as R~ 4 , whereas the length 
of Cr, ttR, increases; overall, j CR fdz decays as R~ 3 . 
This rough argument can be made precise. 

If we replace f(z) by e lfiz /(z), we can evaluate 
Fourier transforms such as 

h = J e lkx f(x) dx with / (x) = x2 + 1 < 
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where k is a real parameter. However, some care is 
needed; as v' kz = e lkx e ky , we have exponential decay 
as y — ■ oo when k > 0 but exponential growth when 
k < 0. Therefore, we use Cr when k > 0, but we close 
using a semicircle in the lower half-plane when k < 0 . 
We find that I 3 = Tre~ ik| . 

Laplace transforms (2) can be inverted using 

i rC+ioo 

git) = 75 — / (z) e zt dz. 

2 7T1 Jc-ioo 

The contour (called the Bromwich contour ) is parallel 
to the j’-axis in the z-plane. The constant c is chosen 
so that all the singularities of /(z) are to the left of 
the contour. If f(z) has poles only, the integral can be 
evaluated by closing the contour using a large semi- 
circle on the left. 

There are many other applications of contour-inte- 
gral methods to the evaluation of integrals. They can 
also be used to find the sums of infinite series. Inte- 
grands containing branch points can also be consid- 
ered. In all cases, one may need some ingenuity in 
selecting an appropriate closed contour and/or the 
function / (z). 

12 Conformal Mapping 

Suppose that /(z) is analytic for z e D. We can regard 
/ as a mapping, taking points z = x + iy to points 
iv = f{z) = u + iv, denote the set of such points in the 
ml’ - plane for all points z e D by R. Given / and D, we 
can determine R. More interestingly, given the regions 
D and R, can we find an analytic function / that maps 
D onto R ? The Riemann mapping theorem asserts that 
any simply connected region D can be mapped to the 
unit disk | w \ < 1. (A region bounded by a simple closed 
curve is simply connected if it does not contain any 
holes.) The analytic function / effecting the mapping 
is called a conformal mapping [II. 5]; two small lines 
meeting at a point zo £ D will be mapped into two 
small lines meeting at a point wo = f(zo) € R, and the 
angles between the two pairs of lines will be equal. The 
conformality property holds for all zo £ D except for 
critical points (where /'(zo) = 0 or 00 ). Many conformal 
mappings are known (there are dictionaries of them), 
but constructing them for regions D with complicated 
shapes or holes remains a challenge. Once a conformal 
mapping is available, it can be used to solve boundary- 
value problems for Laplace’s equation ( 6 ), for example. 


13 Analytic Continuation 

Return to the geometric series (1). Denote the infinite 
series on the left-hand side by /(z), v\ith domain D 
( | z | < 1). Denote the sum on the right-hand side by 
g(z) = 1/(1 - z), with domain D' (z =t= 1). We observe 
that f(z) is analytic in D whereas g(z) is analytic in the 
larger region D' . As /(z) = g(z) for z e D, we say that 
g is the analytic continuation of / into D’ . In practice, 
we do not usually distinguish between / and g, we just 
say that g(z) is analytic for z e D' and that it can be 
defined for z e D c D' using /(z). This point of view 
is surprisingly powerful. 

There are several aspects to this, and it raises several 
questions. To begin with, suppose we are given / and 
D and we want to find g outside D. There are analyti- 
cal and numerical methods available for doing so. For 
example, we could use a chain of overlapping disks with 
a Taylor expansion about each center. The result wall 
be locally unique (each step in the chain gives a unique 
result) but, if g has a branch point, we could step onto 
another branch and thus lose global uniqueness. 

Often, we do not know D'\ typical analytic continu- 
ations will have singularities. For example, the gamma 
function, T(z), is defined by the integral (3) for Re z > 0; 
in this half-plane, T is analytic. If we continue T (z) into 
Re z ^ 0, we find that there are simple poles at z = -N, 
N = 0, 1, 2, . . . (so that D' is the whole complex plane 
with the points z = -N removed). Explicitly, we can use 
Hankel’s loop integral : 

r(z) = . 1 rf t z_1 e t dt. 

2 isin(7 tz) Jc 

This is a contour integral in the complex f-plane. There 
is a cut along the negative real-f axis. The branch of t z is 
chosen so that t z = e zlogt wiren t is real and positive. 
The contour starts at Ret = - 00 , below the cut, goes 
once around t = 0, and then returns to Re f = -00 above 
the cut. 

There are also loop integrals for the Riemann zeta 
function, £(z), defined initially for Rez > 1 by the 
series (9). Thus, it turns out that £(z) can be analyt- 
ically continued into the whole z-plane apart from a 
simple pole at z = 1 . 

14 Differential Equations 

We usually think of a differential equation as being 
something to be solved for a real function of a real vari- 
able. However, it can be advantageous to “complexify” 
the problem. One good reason is that we may be able to 
construct solutions using a power-series expansion ( 8 ), 
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and we know that the convergence of such a series is 
governed by singularity locations. (More generally, we 
could use the “method of Frobenius.”) For example, one 
solution of Airy’s equation, i v"{z) = ziv(z), is 


w(z) 


1 3 1-4 6 1-4-7 9 

b 7° + 7° + + ■ ■ ■ 

3! 6! 9! 

which defines an entire function of z. 

We may be able to write solutions as contour inte- 
grals, which then offers possibilities for further analy- 
sis. For example, solutions of Airy's equation can be 
written (or sought) in the form 


w(z) = J e zt+t 3/3 dt, 

where C is a carefully chosen contour in the complex 
t-plane. 

The study of linear differential equations is a well- 
established branch of complex analysis, especially in 
the context of the classical special functions [IV. 7] 
(e.g., Bessel functions and hypergeometric functions). 
Nonlinear differential equations and their associated 
special functions are also of interest. For example, there 
are the six painleve equations [III.24], the simplest 
being w"(z) = 6 w 2 + z; their solutions, known as 
Painleve transcendents, have a variety of physical appli- 
cations, but their properties are not well understood. 


1 5 Cauchy Integrals 


Let C be a simple closed smooth contour. Denote the 
interior of C by D+ and the exterior by I) . Define a 
function F(z) by the Cauchy integral 

f(z) = -M ^-dT, z£C, (27) 
2tti Jc t - z 

where g(t) is defined for t 6 C. For example, if g(t) = 
1, t e C, then 


1 f dT _ 1 1, zeD + , 

2ni Jc t - z 1 o, zeD-. 


(28) 


The integral in (27) is similar to that which appears 
in Cauchy’s integral formula (12), except we are not 
given any information about g(t) when i ^ C. Never- 
theless, under mild conditions on g, F(z) is analytic 
for z G D+ u D-, and F(z) — 0 as z — ■ oo. What 
are the values of F on C? The example (28) suggests 
that we should expect F(z) to be discontinuous as z 
crosses C. Therefore, we consider the limits of F(z) as 
z approaches C (if they exist), and write 


F±(t) = limF(z) with z 6 D± and t e C. (29) 

Z—’t 

For the example (28), F+(t) = 1 and F-(t) = 0. 


Notice that we cannot simply put z = t e C on the 
right-hand side of (27); the resulting integral diverges. 
However, if g is differentiable at t (in fact, Holder 
continuity is sufficient), we can define the Cauchy 
principal-value integral 


l ^HdT = limf d(T> 

JC T - t £-0 Jc, T - t 


dT, 


t e C, 


where C f is obtained from C as follows: draw a little 
circle of radius e, centered at t e C, and then remove 
the piece of C inside the circle. 

Using this definition, define 


Fit) 


j_ r 9 

Ini Jc t 


g( t) 


dT, t e C. 


2Tti Jc t - t 

This function is related to F±(t), defined by (29), by the 
Sokhotski-Plemelj formula : 


F ± (t) = ±hg(t) +F(t), tec. 


(30) 


This describes the “jump behavior” of the Cauchy 
integral F(z) as z crosses C. In particular, 


F+(t)-F-(t)=9(t), teC. (31) 


One elegant consequence of (30) is that the solution, 
iv, of the singular integral equation 


i_ f iv(t) 
ni)cT-t 
is given by the formula 


dT = g(t). 


w ( t ) = — 

tti Jc 


A « 

1 Jc T 


g(r) 


dT, 


t e C, 


t e C. 


16 The Riemann-Hilbert Problem 

Let D± and C be as in section 15. Suppose that two 
functions, Git) and g(t), are given for t e C. Then, the 
basic Riemann-Hilbert problem is to find two functions 
<P+(z) and <P~(z), with <t>± analytic in D±, that satisfy 

<Mt) = G(t)<Mt) +g(t), teC, (32) 

where <P+(t) and <F-(t) are defined as in (29); condi- 
tions on <P-(z) as z — oo are usually imposed too. 
(There is a variant where C is not closed; in this case, the 
behavior near the endpoints of C plays a major role.) 

WhenG = l,wecansolve(32)usingaCauchyintegral 
and (31). When G # 1 , we start with the following homo- 
geneous problem (g = 0). 

Find functions K+ and K-, with K ± analytic in D±, 
that satisfy 


K+(t) = G(t)K-(t), teC. 


(33) 
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Suppose we can find such functions and that they do 
not vanish. Then, eliminating G from (32) gives 

$+(t) <M0 _ g(t) 

K+{t) K-(t) K+(t) ' 

which, again, we can solve using a Cauchy integral 
and (31). 

The problem of finding K± is more delicate. At 
first sight, we could take the logarithm of (33), giving 
log If + - log A'_ = log G. This looks similar to (31), but it 
usually happens that log G{t) is not continuous for all 
t e C, which means that we cannot use (30). However, 
this difficulty can be overcome. 

The problem of finding K± such that (33) is satisfied 
is also the key step in the Wiener-Hopf technique (a 
method for solving linear PDEs with mixed boundary 
conditions and semi-infinite geometries). In that con- 
text, a typical problem would be: factor a given func- 
tion L(z ) as L(z) = L+(z)L_(z), where I+(z) is ana- 
lytic in an upper half-plane, Im z > a, I_ (z) is analytic 
in a lower half-plane, Im z < b, and a < b so that the 
two half-planes overlap. There are also related prob- 
lems where Iisa2x2or3x3 matrix; it is not cur- 
rently known how to solve such matrix Wiener-Hopf 
problems except in some special cases. 

1 7 Closing Remarks 

Complex analysis is a rich, deep, and broad subject 
with a history going back to Cauchy in the 1820s. 
Inevitably, we have omitted some important topics, 
such as approximation theory in the complex plane and 
analytic number theory. There are numerous fine text- 
books, a few of which are listed below. However, do not 
get the impression that complex analysis is a dead sub- 
ject; it is not. In this article we have tried to cover the 
basics, with some indications of where problems and 
opportunities remain. 
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IV.2 Ordinary Differential Equations 

James D. Meiss 


1 Introduction 

Differential equations are near-universal models in ap- 
plied mathematics. They encapsulate the idea that 
change occurs incrementally but at rates that may 
depend upon the state of the system. A system of ordi- 
nary differential equations (ODEs) prescribes the rate 
of change of a set of functions, y(t) = (y\(t), y 2 (t), 
. . . ,yk(t)), that depend upon a single variable t, which 
maybe real or complex. The functions yj are the depen- 
dent variables of the system, and t is the independent 
variable. (If there is more than one independent vari- 
able, then the system becomes a partial differential 
equation [IV.3] (PDE).) Perhaps the most famous ODE 
is Newton’s second law of motion, 

my = F(y,y, t), 

which relates the acceleration of the center of mass 
y £ l 3 of a body of mass m to an externally applied 
force F. This force commonly depends upon the posi- 
tion of the body, y, its velocity, y (e.g., electromag- 
netic or damping forces); and perhaps upon time, t 
(e.g., time-varying external control). The force may also 
depend upon positions of other bodies; a prominent 
example is the n-BODY problem [VI. 16] of gravitation. 
We will follow the convention of denoting the first 
derivative by y or y’ , the second by y or y" , and, in 
general, the kth by y ik K 

Newton’s law is a system of second-order ODEs. More 
generally, an ODE system is of nth order if it involves 
the first n derivatives of a k-dimensional vector y; 
formally, therefore, it is a relation of the form 

G(y,y a) ,y <2) y (n> ;t) = 0. (1) 

An example is Clairaut’s differential equation for a 
scalar function y(t): 

-y + ty + g(y) = 0. (2) 

This is a first-order ODE since it involves only the 
first derivative of y. Equations like this are implicit, 
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since they can be clewed as implicitly defining the 
highest derivative as a function of y and its lower 
derivatives. Clairaut’s equation depends explicitly on 
the independent variable; such ODEs are said to be 
nonautonomous. 

Clairaut showed in 1734 that (2) has a particularly 
simple family of solutions: y(t) = ct + g(c) for any 
cel. While it might not be completely obvious how to 
find this solution, it is easy to verify that it does solve 
(2) by simple substitution since, on the proposed solu- 
tion y = c, wehave -y+ty+giy) = ~(ct+g(c)) + tc+ 
g{c) = 0. More generally, a solution of the ODE (1) on an 
interval (a, b) is a function y(t) that makes (1) identi- 
cally zero for all t e (a, b). More specifically, a solution 
may be required to solve an initial-value problem (IVP) 
or a boundary-value problem (see section 5). For the 
first-order case, the former means finding a function 
with a given value, yito) = yo , at a given “initial” time 
to- For example, the family of solutions to (2) satisfies 
the initial condition >"(0) = yo so long as there is a c 
such that yo = g(c ), i.e., yo is in the range of g. A fam- 
ily y(t',c) that satisfies an IVP for a domain of initial 
values is known as a general solution. 

Apart from these families of solutions, implicit 
ODEs can also have singular solutions. For example, 
(2) also has the solution defined parametrically by 
(t(s),y(s)) = i-g'(s),g(s) - sg'(s)). Again, it is easy 
to verify that this is a solution to (2) by substitution 
(and implicit differentiation), but it is perhaps not obvi- 
ous how to find it. Lagrange showed that some singu- 
lar solutions of an implicit ODE can be found as envel- 
opes of the general solutions, but the general theory 
was developed later by Cayley and Darboux. 

The classical theory of ODEs, originating with New- 
ton in his Method of Fluxions in 1671, has as its goal 
the construction of the general and singular solutions 
of an ODE in terms of elementary functions. However, 
in most cases, ODEs do not have such explicit solu- 
tions. Indeed most of the well-known special func- 
tions [IV.7] of mathematics are defined as solutions 
of differential equations. For example, the bessel func- 
tion [III.2] J n (x) is defined to be the unique solution of 
the second-order, explicit, nonautonomous, scalar IVP 

x 2 y" + xy' + (x 2 - n 2 )y = 0, 

y(0) = s„, 0 , y'(0) = jS n ,i, 

where Sij, the Kronecker delta, is nonzero only when 
i = j and 5 ^ = 1. This equation arises from a num- 
ber of PDEs through separation of variables. Many 


of the properties of J n (e.g., its power-series expan- 
sion, asymptotic behavior, etc.) are obtained by direct 
manipulation of this ODE. 

In most applications, the ODE (1) can be written in 
the explicit form 
d” 

—y = H(y,y,y,...,y {n - lh ,t). (4) 

Such systems can always be converted into a sys- 
tem of first-order ODEs. For example, if we let x = 
( y,y y ( ” _1) ,t) denote a list of d = nk + 1 vari- 

ables, it is then easy to see that (4) can be rewritten as 
the autonomous first-order system 

x = fix) (5) 

for a suitable /. Every coupled set of k, nth-order, 
explicit ODEs can be written in the form (5). 1 The ODE 
(5) is a common form in applications, e.g., in popula- 
tion models of ecology or in Hamiltonian dynamics. In 
general, x £ M, where M is a d-dimensional manifold 
called the phase space. For example, the phase space 
of the planar pendulum is a cylinder with x = ( 0,pg ), 
where 0 and pg are the angle and angular momentum, 
respectively. 

In general, the function / in (5) gives a velocity vector 
(an element of the tangent space TM) for each point in 
the manifold M; thus /: M -> TM. Such a function is a 
vector field. A solution cp : (a,b) — M of (5) is a differ- 
entiable curvex(t) = qp(t) in M with velocity /(cp(t)); 
it is everywhere tangent to /. Given such a curve it is 
trivial to check to see if it solves (5); by contrast, the 
construction of solutions is a highly nontrivial task. A 
general solution of (5) has the form x(t) = qp(t',c). 
Here, c £ R d is a set of parameters such that, for 
each to £ (a, b) and each initial condition xo £ M, the 
equation cp(t 0 ; c) = xo can be solved for c. 

A general solution of (5) is a solution x(t) = ep(t'C) 
that depends upon d parameters, c, such that for any 
IVP, x(to) = xo £ M v\ith to £ [a, b), there is a c £ R d 
such that <p(t 0 ;c) = xq. The search for explicit, general 
solutions of (5) is, in most cases, quixotic. 

2 First-Order Differential Equations 

Many techniques were developed through the first 
half of the eighteenth century for obtaining analyti- 
cal solutions of first-order ODEs. The equations were 
often motivated by mechanical problems such as the 


1. Though (5) is autonomous, the study of nonautonomous equa- 
tions per se is not without merit. For example, stability of periodic 
orbits is most fruitfully studied as a nonautonomous linear problem. 
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Figure 1 Solutions of the logistic ODE (7). The equilibria 
N = 0 and N = K are the horizontal lines. 


isochrone (find a pendulum whose period is inde- 
pendent of amplitude), which was solved by James 
Bernoulli in 1690, and da Vinci's catenary (find the 
shape of a suspended cable), which was solved by John 
Bernoulli in 1691. 

During this period a number of methods were de- 
vised that can be applied to general categories of sys- 
tems. In 1691 Leibniz formulated the method of sep- 
aration of variables: the formal solution of the ODE 
dy I Ax = g(x)h(y) has the implicit form 

\^) = \ gMAX - (6) 

Any autonomous first-order ODE is separable. For 
example, for the logistic population model [III. 19] 

N = rJV(l -N/K), (7) 


the integrals can be performed and the result solved 
for N to obtain 


N(t) = 


N 0 K 

Nq + (K-N 0 )e- rt ' 


( 8 ) 


Representative solutions are sketched in figure 1. Note 
that, whenever No > 0, this solution tends, as t — oo, 
to K, the “carrying capacity” of the environment. This 
value and N = 0 are the two equilibria of (7), since the 
vector field vanishes at these points. 

The differential equation N(x,y)y' + M(x,y) = 0 
can be formally rewritten as the vanishing of a dif- 
ferential one-form, M(x,y) dx + N(x,y) dy = 0. In 
1739 Clairaut solved such equations when this one- 
form is exact, that is (for R 2 ), when dM /dy = dN/dx. 
In 1734 Euler had already developed the more general 
method of integrating factors: if one can devise a func- 
tion F(x,y) such that the form F ( M dx + N dy ) equals 


the total differential 


dH = 


dH_ 

dx 


dx + 


m 

dy 


dy 


of a function H(x,y ), then the solutions to dH = 0 
lie on contours of H(x,y). As an example, the inte- 
gral curves of a Hamiltonian system with one degree of 
freedom, 


x = 


dH_ 

dy 


(x,y). 


y = 


dH_ 

dx 


(x,y), 


(9) 


are those curves that are everywhere tangent to the 
velocity; equivalently, they are orthogonal to the gra- 
dient vector VH = (dH/dx, dH/dy). Denoting the 
infinitesimal tangent vector by (Ax, Ay), this require- 
ment becomes the exact one-form (dx,d 3 ') • VH = 0. 
Its solutions lie on contours H(x,y) = E with con- 
stant “energy.” This method gives the phase curves 
or trajectories of the planar system but does not pro- 
vide the time-dependent functions x(f) and y(t). How- 
ever, using the constancy of H, the ODE for x, say, 
becomes x = d y H(x,y(x\E)), a separable first-order 
equation whose solution can be obtained up to the 
quadrature (6). 

The technique of substitution was also used to solve 
many special cases (just as for integrating factors, 
there is no general prescription for finding an appro- 
priate substitution). For example, James Bernoulli’s 
nonautonomous, first-order ODE 


y' = P(x)y + Q_(x)y n 

is linearized by the change of variables z = y 1 ~ n . Sim- 
ilarly, Leibniz showed that the degree-zero, homoge- 
neous equation y' = G(y/x) becomes separable with 
the substitution y (x) = xv(x). 

Discussion of these and other methods can be found 
in various classic texts, such as Ince (1956). 


3 Linear ODEs 


In 1743 Euler showed how to solve the nth-order linear 
constant-coefficient equation 


n 

X a i 

j = o 


d iy 
dti 


= 0 


( 10 ) 


by using the exponential ansatz, y = e rt , to reduce the 
ODE to the nth-degree characteristic equation p(r) = 
N” =0 ajr J = 0. Each root, r^, of p provides a solution 
y(t) = e nt . Linearity implies that a superposition of 
these solutions, y(t) = Y,k=o c k^ nt , is also a solution 
for any constant coefficients Ck ■ When p(r) has a root 
r* of multiplicity m > 1, Euler’s reduction-of-order 


184 


IV. Areas of Applied Mathematics 


method suggests the further ansatz y(t) = e r * t u(t). 
This provides new solutions when u satisfies n (m) = 0, 
which has as its general solution a degree - (m - 1 ) poly- 
nomial in t. The general solution therefore becomes a 
superposition of n linearly independent functions of 
the form t^e nt . Even when the ODE is real, the roots 
rjt = + i fik may be complex. In this case, the con- 

jugate root can be used to construct real solutions of 
the form t^e“ kt cosi/^t) and t f e akt sinl/ljT) for £ = 
0, . . . , m - 1 . A superposition of these real solutions has 
n arbitrary real constants Ck, and since the functions 
are independent, there is a choice of these constants 
that solves the IVP y {k) (to) = bk, k = 0, ... ,n - 1, for 
arbitrarily specified values bk- 

More generally, when (5) is linear, it reduces to 

x = Ax (11) 

for a constant, n x n matrix A. Formally, the general 
solution of this system can be written as the matrix 
exponential: x(t) = e tA x(0). As for more general func- 
tions of matrices [11.14], we can view this as defining 
the symbol e tA as the solution of the ODE. More explic- 
itly, this exponential is defined by the same convergent 
MacLaurin series as e at for scalar a. If A is semisimple 
(i.e., if it has a complete eigenvector basis), then A is 
diagonalized by the matrix P whose columns are eigen- 
vectors: A = PAP -1 , where A = diag(Ai, . . . , A^) is the 
diagonal matrix of eigenvalues. In this case, 

e tA = Pdiag(e Alt ,...,e Ant )P _1 . 

More generally, e tA also contains powers of t, general- 
izing the simpler, scalar situation. 

The nonhomogeneous linear system 

x = Ax + g(t) 

with forcing function g e M™ can be solved by 
Lagrange’s method of variation of parameters. The idea 
is to replace the parameters x(0) in the homogeneous 
solution by functions u(t). Substitution of x(t) = 
e tA u(t) into the ODE permits the unknown functions 
to be isolated and yields the integral form 

x(t) = e tA x( 0) + f e (t ~ s)A g(s) ds. 

Jo 

Solution of linear, nonautonomous ODEs, x = A(t)x, 
is much more difficult. The source of the difficulty 
is that A(t) does not generally commute with A(s) 
when t * s. Indeed, e A e B * e A+B unless the matrices 
do commute (the Baker-Campbell-Hausdorff theorem 
from Lie theory gives a series expansion for the prod- 
uct). The special case of a time-periodic family of matri- 
ces A(t) = A(t + T) can be solved. Floquet showed 


that the general solution for this case takes the form 
x(t) = P(t)e tB x( 0), where B is a real, constant matrix 
and P ( t ) is a periodic matrix with period 2 T. One much- 
studied example of this form is mathieu’s equation 
[111.21]. 

More generally, finding a transformation to a set of 
coordinates in which the effective matrix is constant is 
called the reducibility problem ; even for a quasiperiodic 
dependence on time, this is nontrivial. 

4 Singular Points 

Consider the linear ODE (10), now allowing the coef- 
ficients aj(t ) to be analytic functions of t e C so 
that it is nonautonomous. Cauchy showed that, if the 
coefficients are analytic in a neighborhood of to and 
if a n (to) * 0, this ODE has n independent analytic 
solutions. The coefficients of the power series of y 
can be determined from a recursion relation upon 
substitution of a series for y into the ODE. 

A point at which some of the ratios aj(t)/a n (t) are 
singular is a (fixed) singular point of the ODE, and the 
solution need not be analytic at to- There are two dis- 
tinct cases. A singular point is regular if a n -j(t)/a n (t) 
has at most a jth-order pole for each j = 1, . . . , n. In 
this case, there is an r G C such that there is at least 
one solution of the form y{t) = (t — to) r <p(.t) with <p 
analytic at to- Additional solutions may also have log- 
arithmic singularities. An ODE for which all singular 
points are regular is called Fuchsian. 

Most of the special functions [IV. 7] of mathemati- 
cal physics are defined as solutions of second-order lin- 
ear ODEs with regular singular points. Many are special 
cases of the hypergeometric equation 

z( 1 - z)w" + (y - (a + p + 1 )z)w' - ajhu = 0 (12) 

for a complex-valued function w(z). This ODE has reg- 
ular singular points at z = 0,1, and co (the latter 
obtained upon transforming the independent variable 
to it = 1 /z). For the singular point at 0, following Frobe- 
nius, we make the ansatz that the solution has the form 
of a series: 

w(z) = z r X c i z ' *■ 

3 = 0 

Substitution into (12) yields cop{r)z r ~ 1 + O (z r ) = 0, 
and if this is to vanish with co * 0, then r must satisfy 
the indicia 1 equation p(r) = r 2 + (y - l)r = 0, with 
roots r\ = 0 and t "2 = 1 - y. A recursion relation for the 
Cj, j > 0, is obtained from the terms of order z r+J " 1 . 
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For n = 0, this yields the gauss hypergeometric 
function [IV. 7 §5] 


F(a, /?, y;z) 


r(y) y rux + jmp + j) j 
rux)r(P) j\r( Y + j ) 


where the gamma function E generalizes the facto- 
rial. When y $ Z, the second solution turns out to be 
x 1_y .F(a- y + 1 , j8 - y + 1, 2 - y; x), which is not analytic 
at z = 0. 

When the indicial equation has roots that differ by an 
integer, a second solution can be found by the method 
of reduction of order. For example, for the second- 
order case, suppose ry — r2 £ N and let wi(z) = 
e nt <pi(t) be the solution for r\. Substitution of the 
ansatz w(z) = wi(z) / v(z) dz shows that v satisfies a 
first-order ODE with a regular singular point at 0 whose 
indicial equation has the negative integer root V 2 -ry - 1. 
If the power series for v has Cl(z _1 ) terms, then w(z) 
has logarithmic singularities; ultimately, the second 
solution has the form iv(z) = z'' 2 </> 2 (z) + civi(z)\nz, 
where <p2 is analytic at 0 and c might be zero. Thus, for 
example, the second hypergeometric solution for inte- 
gral y, where the roots of the indicial equation differ 
by an integer, has logarithmic singularities. 

Near an irregular singular point, the solution may 
have essential singularities. For example, the first-order 
ODE z 2 w' = w has an irregular singular point at 0; its 
solution, iv (z) = ce~ 1/z , has an essential singularity 
there. Similarly, Bessel’s equation (3) has an irregular 
singular point at oo. 

Singular points of nonlinear ODEs can be fixed (i.e., 
determined by singularities of the vector field) or move- 
able. In the latter case, the position of the singularity 
depends upon initial conditions. The study of equa- 
tions whose only movable singularities are poles leads 
to the theory of painleve transcendents [III.24]. 


5 Boundary-Value Problems 

So far we have considered IVPs for systems of the 
form (5), that is, when the imposed values occur at one 
point, t = to- Another common formulation is that of a 
boundary-value problem (BVP), where properties of the 
solution are specified at two distinct points. Such prob- 
lems commonly occur for ODEs that arise by separa- 
tion of variables from PDEs. They also occur in control 
theory, where constraints may be applied at different 
times. 


A classical BVP is the Sturm-Liouville equation: 
-(p(x)y'Y + q(x)y = A r(x)y, 
<xiy(a) + oi 2 y , (a)= 0 , 

Piy(b) + p 2 y'(b) = 0. 


(13) 


Here, A is a parameter, e.g., the separation constant 
for the PDE case, p £ C 1 [a,i>], q,r £ C°[a,b], p 
and the weight function r are assumed to be posi- 
tive, and <xia 2 ,Pip 2 * 0. For example, Bessel’s equa- 
tion (3) takes this form, with p(x) = -qix) = x and 
r(x) = 1/x, if appropriate boundary conditions are 
imposed. 

The Sturm-Liouville problem has (unique) solutions 
y„(x) £ C 2 [a, b] only for a discrete set A n , n £ 
N, of values of the separation constant. Moreover, 
these “eigenfunctions” and their corresponding “eigen- 
values” have a number of remarkable properties. 


Ordering: Ai < A 2 < ■ ■ ■ < A n < ■ ■ ■ . 

Oscillation: y n (x) has n - 1 simple zeros in (a, b). 
Growth: A n — ■ 00 as n — 00 . 

Orthogonality: 1% r(x)y n (x)y m (x) dx = 6 m , n - 
Completeness: the set y n is a basis for the space 
L 2 (a,b). 


Perhaps the simplest such problem is y" = -A y 
with y( 0) = y( 1) = 0. Here, the eigenvalues are A n = 
(wt ) 2 and the eigenfunctions are y n = sin(nnx). 
The completeness of these functions in I 2 (0, 1) is 
the expression of the convergence of the Fourier sine 
series. A more interesting problem is the quantum har- 
monic oscillator, which, when nondimensionalized, is 
governed by the Schrodinger equation 

-ip" + x 2 ip = Aip. (14) 

Here, A is related to the energy E = ^Ahw for classical 
frequency to. This is a Sturm-Liouville problem for ip £ 
L 2 (- 00 , 00 ). The solutions are most easily obtained by 
the substitution ip(x) = e~ x2/2 y(x) that transforms 
(14) to the Hermite equation y" - 2 xy' + (A - l)y = 
0. This ODE has degree-(n - 1) polynomial solutions 
when A n = 2n - 1, n £ N; otherwise, the wave function 
ip is not square integrable. The first five orthonormal 
eigenstates of (14) are shown in figure 2. 


6 Equilibria and Stability 

Apart from the solution of linear PDEs, linear systems 
of ODEs find their primary application in the study of 
the stability of equilibria of the nonlinear system (5). A 
point x* is an equilibrium if fix*) = 0. If / £ C 1 , then 
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Figure 2 The first five eigenstates of 
the Sturm-Liouville problem (14). 



Figure 3 A Lyapunov-stable equilibrium x*. 

the dynamics of a nearby point x(t) = x* + 5x(t) may 
be approximated by 5x = D/(x*)<5x, where (D f)ij = 
dfi/dxj is the Jacobian matrix of the vector field. 

An equilibrium is Lyapunov stable if, for every neigh- 
borhood U, there is a neighborhood V c U such that if 
x(0) e V then x(t) e U for all t > 0 (see figure 3). For 
the case in which A = D f(x*) is a hyperbolic matrix 
(its spectrum does not intersect the imaginary axis), 
the stability of x* can be decided by the eigenvalues 
of A. Indeed, the Hartman-Grobman theorem states 
that in this case there is a neighborhood U of x* such 
that there is a coordinate change (a homeomorphism) 
that takes the dynamics of (5) in U to that of (11). In 
this case we say that the two dynamical systems are 
topologically conjugate on U. 

An equilibrium is stable if all of the eigenvalues of 
A are in the left half of the complex plane, Re(A) < 0. 
Indeed, in this case it is asymptotically stable: there is a 
neighborhood U such that every solution that starts in 


U remains in U and converges to x* as t — oo. In this 
case, x* is a stable node. When there are eigenvalues 
with both positive and negative real parts, then x* is a 
saddle. The case of complex eigenvalues deserves spe- 
cial mention, since the solution of the linear system 
then involves trigonometric functions and there are 
solutions in R d that are infinite spirals. This is not nec- 
essarily the case when nonlinear terms are added, how- 
ever: the homeomorphism that conjugates the system 
in U may unwrap the spirals. 

As an example consider the damped Duffing oscilla- 
tor 2 

x = y, y = -py+x{ 1-x 2 ), (15) 


with the phase portrait shown in figure 4 when p = 
j. There are three equilibria: (0,0) and (±1,0). The 
Jacobian at the origin is 


D/(0,0) 



with eigenvalues Ai ,2 = — 4(1 ± VT7). Since these are 
real and of opposite signs, the origin is a saddle. By 
contrast, the Jacobian of the other fixed points is 



with the complex eigenvalues Ai ,2 = - 5(1 ± i\/3T). 
Since the real parts are negative, these points are both 
attracting foci. They are still foci in the nonlinear sys- 
tem, as illustrated in figure 4, since trajectories that 
approach them cross the line y = 0 infinitely many 
times. Apart from the saddle and its stable manifold 
(the dotted curve in the figure), every other trajectory 
is asymptotic to one of the foci; these are attractors 
whose basins of attraction are separated by the stable 
manifold of the saddle. 

The stability of a nonhyperbolic equilibrium (when 
A has eigenvalues on the imaginary axis) is delicate 
and depends in detail on the nonlinear terms, i.e., the 
0(5x 2 ) terms in the expansion of / about x*. For 
example, the system 

x = -y + ax(x 2 +y 2 ), y = x + ay(x 2 + y 2 ) (16) 


has only one equilibrium, (0,0). The Jacobian at the 
origin has eigenvalues A = ±i; its dynamics are that of 
a center. Nevertheless, the dynamics of (16) near (0, 0) 
depend upon the value of a. This can be easily seen 
by transforming to polar coordinates using (x,y) = 


2. George Duffing studied the periodically forced version of (15) in 
1918. 
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x 


Figure 4 The phase portrait of (15) for p = j- Arrows 
depict the vector field, and dots depict the three equilib- 
ria. The unstable (dashed) and stable (dotted) manifolds of 
the saddle are shown. 

(rcosd,rsin0): 

r = —(xx + yy) = ar 3 , 
r 

0 = \(xy - yx) = 1. 
r- 

Thus if a < 0, the origin is a global attractor: every tra- 
jectory limits to the origin as t — oo. If a > 0, the origin 
is a repellor. The study of nonhyperbolic equilibria is 
the first step in bifurcation theory [IV.21]. 

7 Existence and Uniqueness 

Before one attempts to find solutions to an ODE, it is 
important to know whether solutions exist, and if they 
exist whether there is more than one solution to a given 
IVP. There are two types of problems that can occur. 

The first is that the velocity / may be unbounded on 
M; in this case, a solution might exist but only over a 
finite interval of time. For example, the system x = x 2 
for xel has the general solution x(t) = xo/(l - tx o). 
Note that \x\ — oo as t — 1/xo, even though / is a 
“nice” function: it is smooth, and moreover, it is ana- 
lytic. The problem is, however, that as \x\ increases, the 
velocity increases even more rapidly, leading to infinite 
speed in finite time. The existence theorem deals with 
this problem by being local; it guarantees existence only 
on a compact interval. 

The second problem is that f may not be smooth 
enough to guarantee a unique solution. One might 
expect that it is sufficient that / be continuous. How- 
ever, the simple system x = ^\x\ for x e R has 


infinitely many solutions that satisfy the initial con- 
dition x(0) = 0. The obvious solution is x(t) = 0, 
but x(t) = | sgn(t)t 2 is also a solution. Moreover, any 
function x ( t ) that is zero up to an arbitrary time t o > 0 
and then connects to the parabola 5 (i - to ) 2 also solves 
the IVP. Elimination of this problem requires assuming 
that f is more than continuous; it must be at least Lip- 
schitz. A function f:M — R d is Lipschitz on Me R m 
if there is a constant K such that for all x, y 6 M, 
\\f(x)-f(y)\\^K\\x-y\\. 

With this concept, we can state a theorem of exis- 
tence and uniqueness. Let B r (x) denote the closed ball 
of radius r about x. 

Theorem 1 (Picard-Lindelof). Suppose that for xo e 
R d there exists b > 0 such that f: Bb(xo) — is Lip- 
schitz. Then the IVP (5) with x(to) = xq has a unique 
solution x: [to - a, to + a] — Bb(x 0 ) with a = b/V, 
where V = max xeBh (xo) \\f(x)\\. 

This theorem can be proved iteratively (e.g., by Picard 
iteration), but the most elegant proof uses the contrac- 
tion mapping theorem. 

8 Flows 

When the vector held of (5) satisfies the conditions of 
the Picard-Lindelof theorem, the solution is necessarily 
a C 1 function of time. It is also a Lipschitz function of 
the initial condition. Suppose now that / e C 1 (R d ,]R' i ) 
and is locally Lipschitz. Though the theorem guaran- 
tees the existence only on a (perhaps small) interval t e 
[to~a, to+a], this solution can be uniquely extended to 
a maximal open interval J = (a, fl) such that the solu- 
tion is unbounded as t approaches a or /I when they 
are finite. As noted in section 7, unbounded solutions 
may arise even for “nice” vector fields; however, if f is 
bounded or globally Lipschitz, then J = M. 

If cpt(x 0 ) denotes the maximally extended solution, 
then qp\ J xR d — M. d satisfies a number of conditions: 

• <p £ C 1 , 

• qpo(x) = x, and 

• qpt ° qp s = ‘T’t+s whenever t, 5, and t + s e J. 

The last condition encapsulates the idea of autonomy: 
flowing from the point cp s M for a time t is the same 
as flowing for time t + s from x; the origin of time is a 
matter of convention. 

For example, ( 8 ), the solution of the logistic ODE, 
gives such a function if it is rewritten as (p t (Alo) = N(t). 
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In this case (for r and K positive), J = R if 0 ^ Mo ^ K , 
and J = (a, oo) with 



if Nq > K. Indeed, it is apparent from figure 1 that solu- 
tions with initial conditions above the carrying capacity 
K grow rapidly for decreasing t; the theory implies that 
qpt(No) — ■ oo as t I «. 

More generally, any function satisfying these condi- 
tions is called a flow. The flow is complete if J = M, and 
it is a semiflow if a is finite but is still oo . It is not hard 
to see that every flow is the solution of a differential 
equation (5) for some C° vector field. Flows form one 
fundamental part of the theory of dynamical systems 
[IV. 20]. 

9 Phase-Plane Analysis 

A system of two differential equations 
x = P(x,y), 
y = Qix,y) 

can be qualitatively analyzed by considering a few sim- 
ple properties of P and Q. The goals of such an analy- 
sis include determining the asymptotic behavior for 
t — ± oo and the stability of any equilibria or periodic 
orbits. 

The nullclines are the sets 

Nh = [x,y: Q(x,y) = 0}, 

Ny = { x,y : P(x,y) = 0}. 

Typically these are curves on which the instantaneous 
motion is horizontal or vertical, respectively. The set of 
equilibria is precisely the intersection of the nullclines, 
E = Nh n N v . The web of nullclines divides the phase 
plane into sectors in which the velocity vector lies in 
one of the four quadrants. 

For example, the Lotka-Volterra system 

x = hx(l - x - 2 y), 
y = cy(l — 2 x - y) 

can be thought of as a model of competition between 
two species with normalized populations x ^ 0 and 
y ^ 0. The species have per capita birth rates b and 
c, respectively, when their populations are small, but 
these decrease if either or both of x and y grows 
because of competition for the same resource. In the 
absence of competition, the environment has a carry- 
ing capacity of one population unit. The nullclines are 
pairs of lines JVh = {v = 0} u {y = 1 - 2xj and 



Figure 5 The phase portrait for (18) for 2c = 3b > 0. Repre- 
sentative velocity vectors are shown in each sector defined 
by the nullclines, as are several numerically generated 
trajectories. 

N v = {x = 0} u [y = ^(1 - x)}. Consequently, there 
are four equilibria: (0,0), (1,0), (0,1), and (|, j). The 
nullclines divide the biologically relevant domain into 
four regions within which the velocity lies in one of 
the four quadrants, as shown in figure 5. In particular, 
when both x and y are large (e.g., bigger than the car- 
rying capacity), the velocity must be in the third quad- 
rant since both x < 0 and y < 0. Since a component 
of the velocity can reverse only upon crossing a null- 
cline (and in this case does reverse), the remainder of 
the qualitative behavior is then determined. 

From this simple observation one can conclude that 
the origin is a source, i.e., every nearby trajectory 

approaches the origin as t oo. By contrast, the two 

equilibria on the axes are sinks since all nearby tra- 
jectories approach them as i — ■ +oo. The remaining 
equilibrium is a saddle since there are approaching and 
diverging solutions nearby. Moreover, every trajectory 
in the positive quadrant is bounded, and almost all tra- 
jectories asymptotically approach one of the two sinks. 
The only exceptions are a pair of trajectories that are 
on the stable manifold of the saddle. Details that are 
not determined by this analysis include the timescale 
over which this behavior occurs and the curvature of 
the solution curves, which depends upon the ratio b/c. 
This model demonstrates the ecological phenomenon 
of competitive exclusion; typically, only one species 
survives. 
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10 Limit Cycles 

If the simplest solutions of ODE systems are equilib- 
ria, periodic orbits form the second class. A solution 
E = jx(t): 0 ^ t < T} of an autonomous ODE is peri- 
odic with (minimal) period T if E is a simple closed 
loop in the phase space. Indeed, uniqueness of solu- 
tions implies that, if x(T) = x(0), then x(nT ) = x(0) 
for all n e Z. 

A one-dimensional autonomous ODE (5) cannot have 
any periodic solutions. Indeed, every solution of such 
a system is a monotone function of t. Periodic trajecto- 
ries are common in two dimensions. For example, each 
planar Hamiltonian system (9) has periodic trajecto- 
ries on every closed nondegenerate (VH =/= 0) contour 
H(x,y) = E. These periodic trajectories are not iso- 
lated. An isolated periodic orbit is called a limit cycle. 
More generally, a limit cycle is a periodic orbit that is the 
forward (to) or backward («) limit of another trajectory. 

The van der Pol oscillator, 

x = y, y = -x + 2 py - x 2 y , (19) 

was introduced in 1922 as a model of a nonlinear cir- 
cuit with a triode tube. Here, x represents the current 
through the circuit, and y represents the voltage drop 
across an inductor. The parameter /./ corresponds to the 
“negative” resistance of the triode passing a small cur- 
rent. This system has a unique periodic solution when 
p > 0 (see figure 6). The creation of this limit cycle at 
p = 0 follows from the hopf bifurcation [IV.21 §2] 
theorem. Its uniqueness is a consequence of a more 
general theorem due to Lienard. 

Planar vector fields can therefore have equilibria and 
periodic orbits. Are there more complicated trajecto- 
ries, e.g., quasiperiodic or chaotic orbits? The negation 
of this speculation is contained in the theorem pro- 
posed by Poincare and proved later by Bendixson: the 
set of limit points of any bounded trajectory in the 
plane can contain only equilibria and periodic orbits. 
There is therefore no chaos [II.3] in two dimensions\ 
From the point of view of finding periodic trajectories, 
this theorem implies the following. 

Theorem 2 (Poincare-Bendixson). Suppose that A c 
M 2 is bounded and positively invariant and that qp 
is a complete semiflow in A. Then, if A contains no 
equilibria, it must contain a periodic orbit. 

For example, consider the system 

x = y, y = -x + yh(r), 



Figure 6 The phase portrait of the van der Pol 
oscillator (19) for p = 0.2. 


where r = ^ x 2 + y 2 , and let A be the annulus { (x, y) : 
a < r < b}. Thus A contains no equilibria for any 0 < 
a < b. Converting to polar coordinates gives, for the 
radial equation, 

v 2 

r = - — h(r). 
r 

Now suppose that there exist 0 < a < b such that 
h(b) < 0 < h(a). On the circle r = a we then 
have f ;> 0, implying that trajectories camiot leave A 
through its inner boundary. Similarly, trajectories can- 
not leave through r = b because r ^ 0 on this circle. 
By the Poincare-Bendixson theorem, therefore, there is 
a periodic orbit in A. 

1 1 Heteroclinic Orbits 

Suppose that cp is a complete C r+1 flow that has a sad- 
dle equilibrium at x*. The fc-generalized eigenvectors 
of D/(x*) corresponding to the stable eigenvalues, 
Re(Aj) < 0, define a k-dimensional tangent plane E s 
at x*. This linear plane can be extended to form a 
set of trajectories of the nonlinear flow whose forward 
evolution converges to x*: 

IT s (x*) = jxeM\{x*}: lim<p t (x) = x*|. 

The stable manifold theorem [IV. 20] implies that 
this set is a ^-dimensional, C r , immersed manifold 
that is tangent to E s at x*. Similarly, a saddle has an 
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unstable manifold 

W u (x*) = ]x eM\{x*}: lim <p t (x) = x*l 

that is tangent to the (n - k) -dimensional plane 
spanned by the unstable eigenvectors of x* . It is impor- 
tant to note that this set is defined by its backward 
asymptotic behavior and not by the idea that it escapes 
from x* . These concepts can also be generalized to 
hyperbolic invariant sets. 

Poincare realized that intersections of stable and 
unstable manifolds can give rise to complicated orbits. 
He called an orbit F homoclinic if f e IT u (x*) n 
IT s (x*). Similarly, an orbit is heteroclinic if F e 
lT u (fl) n W s (b) for distinct saddles a and b. 

Planar Hamiltonian systems often have homoclinic 
or heteroclinic orbits. For example, the conservative 
Duffing oscillator, (15) with p = 0, has Hamiltonian 
H(x,y) = \ly 2 - x 2 + |x 4 ). This function has a crit- 
ical level set H = 0 that is a figure-eight intersecting 
the saddle equilibrium at (0,0). Since energy is con- 
served, trajectories remain on each level set; in partic- 
ular, every trajectory on the figure-eight is biasymptotic 
to the origin (these are homoclinic trajectories). For this 
case, the stable and unstable manifolds coincide, and 
we say that there is a homoclinic connection. This set is 
also called a separatrix since it separates motion that 
encircles each center from that enclosing both centers. 
Such a homoclinic connection is fragile; for example, 
it is destroyed whenever p * 0 in (15). More gener- 
ally, a homoclinic bifurcation [IV.21] corresponds to 
the creation/destruction of a homoclinic orbit from a 
periodic one. 

If, however, the intersection of 14™ (a) with W s (b) is 
transverse, it cannot be destroyed by a small perturba- 
tion. A transversal intersection of two submanifolds is 
one for which the union of their tangent spaces at an 
intersection point spans TM: 

T x W u (a) © T x W u (b) = T X M. 

Note that for this to be the case, we must have 
dim(I4™) + dim(IT s ) ^ dim(M). Every such intersec- 
tion point lies on a heteroclinic orbit that is structurally 
stable. Poincare realized that in certain cases the exis- 
tence of such a transversal heteroclinic orbit implies 
infinite complexity. This idea was formalized by Steve 
Smale in his construction of the Smale horseshoe. The 
existence of a transversal heteroclinic orbit implies a 
chaotic invariant set. 


12 Other Techniques and Concepts 

Differential equations often have discrete or continu- 
ous symmetries [IV.22], and these are useful in con- 
structing new solutions and reducing the order of the 
system. Given sufficiently many symmetries and invari- 
ants, a system of ODEs can be effectively solved, that 
is, it is integrable. 

One often finds that no analytical method leads to 
explicit solutions of an ODE. In this case, numerical 
solution [IV. 1 2 ] techniques are invaluable. 
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IV. 3 Partial Differential Equations 

Lawrence C. Evans 


1 Overview 

This article is an extremely rapid survey of the modern 
theory of partial differential equations (PDEs). Sources 
of PDEs are legion: mathematical physics, geometry, 
probability theory, continuum mechanics, optimization 
theory, etc. Indeed, most of the fundamental laws of the 
physical sciences are partial differential equations and 
most papers published in applied mathematics concern 
PDEs. 

The following discussion is consequently very broad 
but also very shallow, and it will certainly be inadequate 
for any given PDE the reader may care about. The goal 
is rather to highlight some of the many key insights and 
unifying principles across the entire subject. 

1.1 Confronting PDEs 

Among the greatest accomplishments of the physi- 
cal and other sciences are the discoveries of funda- 
mental laws, which are usually PDEs. The great prob- 
lems for mathematicians, both pure and applied, are 
then to understand the solutions of these equations, 
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using theoretical analysis, numerical simulations, per- 
turbation theory, and whatever other tools they can 
find. 

But this very success in physics— that some fairly 
simple-looking PDEs, for example the Euler equations 
for fluid mechanics (see (1 1) below), model complicated 
and diverse physical phenomena— causes all sorts of 
mathematical difficulties. Whatever general assertion 
we try to show mathematically must apply to all sorts 
of solutions with extremely disparate behavior. 

It is therefore a really major undertaking to un- 
derstand solutions of partial differential equations, 
and there are at least three primary mathematical 
approaches for doing so: 

• discovering analytical formulas for solutions, ei- 
ther exact or approximate, 

• devising accurate and fast numerical methods, 
and 

• developing rigorous theory. 

In other words, we can aspire to actually solve the PDE 
more or less explicitly, to compute solutions, or to indi- 
rectly deduce properties of the solutions (without rely- 
ing upon formulas or numerics). This article surveys 
these viewpoints, with particular emphasis on the last. 

Terminology 

A partial differential equation is an equation involving 
an unknown function u of more than one variable and 
certain of its partial derivatives. The order of a PDE is 
the order of the highest-order partial derivative of the 
unknown appearing within it. 

A system of PDEs comprises several equations involv- 
ing an unknown vector-valued function u and its partial 
derivatives. 

A PDE is linear if it corresponds to a linear opera- 
tor acting on the unknown and its partial derivatives; 
otherwise, the PDE is nonlinear. 

Notation 

Hereafter, u usually denotes the real-valued solution 
of a given PDE and is usually a function of points 
x = (xi, ...,x n ) eR", typically denoting a position in 
space, and sometimes also a function of t e 1, denot- 
ing time. We write u Xk = du/dx k to denote the par- 
tial derivative of u with respect to x k , and u t = du/dt , 
Ux k xi = d 2 u/dx k dxi, etc., for higher partial derivatives. 
The gradient of u in the variable x is 

Vu = (u Xl ,-..,u Xn ). 


(In this article, V u always denotes the gradient in the 
variables xi,...,x n , even if u also depends on t.) We 
write the divergence of a vector field F = (F 1 , ...,F n ) 
as divF = X", F j ( . 

The Laplacian of u is the divergence of its gradient: 

fl 

Au = V 2 u = u XkXk . (1) 

k = 1 

Let us also write u = ( u l ,...,u m ) to display the 
components of a vector-valued function. We always use 
boldface for vector-valued mappings. 

The solid n-dimensional ball with center x and radius 
r is denoted by B(x,r), and 3 B(x,r) is its boundary, 
a sphere. More generally, dU means the boundary of a 
set U c and we denote by 

[ fdS 

Jsu 

the integral of a function / over the boundary, with 
respect to (n - 1 (-dimensional surface area. 

1.2 Some Important PDEs 

A list of some of the most commonly studied PDEs fol- 
lows. To streamline and clarify the presentation, we 
have mostly set various physical parameters to unity 
in these equations. 

First-Order PDEs 

First-order PDEs appear in many physical theories— 
mostly in dynamics, continuum mechanics, and optics. 
For example, in the scalar conser\>ation law 

u t + divF(w) = 0, (2) 

the unknown u is the density of some physically inter- 
esting quantity and the vector field F(u), its flux, 
depends nonlinearly on u . 

Another important first-order PDE, the Hamilton- 
Jacobi equation 

ut + H(Vm,x) = 0, (3) 

appears in classical mechanics and in optimal control 
theory. In these contexts, H is called the Hamiltonian. 

Second-Order PDEs 

Second-order PDEs model a significantly wider variety 
of physical phenomena than do first-order equations. 
For example, among its many other interpretations, 
LAPLACE’S EQUATION [III. 18] 


Au = 0 


(4) 
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records diffusion effects in equilibrium. Its time-depen- 
dent analogue is the heat equation [III.8] 

u t - A u = 0, (5) 

which is also known as the diffusion equation. 

The WAVE EQUATION [III.31] 

u t t - c 2 Au = 0 ( 6 ) 

superficially somewhat resembles the heat equation, 
but as the name suggests, it supports solutions with 
utterly different behavior. 
schrodinger’s equation [III. 26] 

hit + Au = 0, (7) 

for which solutions u are complex-valued, is the quan- 
tum mechanics analogue of the wave equation. 

Systems ofPDEs 

In a system of conservation laws [II.6] 

u t + divF(w) = 0, (8) 

each component of u = (it 1 , . . . , u m ) typically repre- 
sents a mass, momentum, or energy density. 

A reaction-diffusion system of PDEs has the form 

u t - Am = f(u). (9) 

Here, the components of u typically represent densi- 
ties of, say, different chemicals, whose interactions are 
modeled by the nonlinear term f. 

The simplest form of maxwell’s equations [III.22] 
reads 

E t = curl B, 

B t = -curl£, (10) 

div£ = divB = 0, 

in which E is the electric held and B the magnetic field. 

Fluid mechanics provides some of the most compli- 
cated and fascinating systems of PDEs in applied math- 
ematics. The most important are euler’s equations 
[III. 11] for incompressible, inviscid fluid flow, 

u t + u ■ Vu = -Vp, 
divM = 0, 

and the navier-stokes equations [III.23] for incom- 
pressible, viscous flow, 

u t + u ■ Vu - Am = -Vp, 
divM = 0. 

In these systems u denotes the fluid velocity and p the 
pressure. 




Higher-Order PDEs 

Equations of order greater than two are much less com- 
mon. Generally speaking, such higher-order PDEs do 
not represent fundamental physical laws but are rather 
derived from such. 

For instance, we can sometimes rewrite a system of 
two second-order equations as a single fourth-order 
PDE. In this way, the biharmonic equation 

A 2 u = 0 (13) 

comes up in linear elasticity theory. 

The KORTEWEG-DE VRIES (KDV) EQUATION [111.16] 

u t + auu x + buxxx = 0, (14) 

a model of shallow-water waves, similarly appears 
when we combine a complicated system of lower- 
order equations appearing in appropriate asymptotic 
expansions. 

1.3 Boundary and Initial Conditions 

Partial differential equations very rarely appear alone; 
most problems require us to solve the PDEs subject to 
appropriate boundary and/or initial conditions. If, for 
instance, we are to study a solution u = u(x ), defined 
for points x lying in some region U c R”, we usu- 
ally also prescribe something about how u behaves on 
the boundary dU. The most common prescriptions are 
Dirichlet's boundary condition 

u = 0 on BU (15) 

and Neumann’s boundary condition 

0 II 

— = 0 on dU, (16) 

3v 

where v denotes the outward-pointing unit normal to 
the boundary and du/dv := Vu ■ v is the outer normal 
derivative. If, say, u represents a temperature, then (15) 
specifies that the temperature is held constant on the 
boundary and (16) specifies that the heat flux through 
the boundary is zero. 

Imposing initial conditions is usually appropriate 
for time-dependent PDEs, for which we require of the 
solution u = u(x,t ) that 

u(-,0)=g, (17) 

where g = g(x) is a given function, comprising the 
initial data. For PDEs that are second order in time, such 
as the wave equation (6), it is usually appropriate to also 
specify 

Mt(-,0) = h. (18) 
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2 Understanding PDEs 

In this section we explore several general procedures 
for understanding PDEs and their solutions. 


2.1 Exact Solutions 


The most effective approach is, of course, just to solve 
the PDE outright, if we can. For instance, the boundary- 
value problem 

Au = 0 in B( 0, 1), 


u = g on 3.6(0, 1) 
is solved by Poisson’s formula, 

1 - \x\ 2 f g(y) 


u(x) = 


n<x(n ) JaB(o,D \x - yV 


dS, 


where a(n) denotes the volume of the unit ball in R n . 

The solution of the initial-value problem for the wave 
equation in one space dimension, 

u t t - c 2 u xx = 0 in R x (0, co), 


u = g, ut = h onRx{t = 0}, 


is provided by d’Alembert’s formula : 


u(x,t) = 9(x + Ct)+a(x ~ Ct) 


2c 


rX+Ct 

h(y) d y. 

- X-Ct 


(19) 

The wave equation can also be solved in higher dimen- 
sions, but the formulas become increasingly compli- 
cated. For example, Kirchhoff’s formula, 


u(x,t) = - — f hdS 

4nC 2 t JdB(x,ct) 

+ at { 1 — f 3ds], (20) 

dt L4TTC 2 t JdB(x,ct) J 

satishes this initial-value problem for the wave equa- 
tion in three space dimensions: 

Utt ~ c 2 Au = 0 in M 3 x (0, oo), 
u = g, u t = h on! 3 x {t = 0(. 

The initial-value problem for the heat equation, 
u t - Aw = 0 in R' 1 x (0, oo), 
u = g onR n x {t = 0}, 
has for all dimensions the explicit solution 

U{X ’ t} = (4nt)n/2 L ^ X ~ y]2 ' 4t 0W dy. (23) 
Certain nonlinear PDEs, including the KdV equation 
(14), are also exactly solvable; discovering these so- 
called integrable partial differential equations is a very 
important undertaking. 

It is however a fundamental truth that we cannot solve 
most PDEs, if by to “solve” we mean to come up with a 
more or less explicit formula for the answer. 




2.2 Approximate Solutions and Perturbation 
Methods 

It is consequently important to realize that we can often 
deduce properties of solutions without actually solving 
the PDE, either explicitly or numerically. 

One such approach develops systematic perturba- 
tion schemes to build small “corrections” to a known 
solution. There is a vast repertoire of such techniques. 
Given a PDE depending on a small parameter e, the 
idea is to posit some form for the corrections and 
to plug this guess into the differential equation, try- 
ing then to fine-tune the form of the perturbations to 
make the error as small as possible. These procedures 
do not usually amount to proofs but rather construct 
self-consistent guesses. 

Multiple Scales 

homogenization [11.17] problems entail PDEs whose 
solutions act quantitatively differently on different spa- 
tial or temporal scales, say of respective orders 1 and 
e. Often, a goal is to derive simpler effective PDEs 
that yield good approximations. We guess the form 
of the effective equations by supposing an asymptotic 
expansion of the form 

OO 

U € (x) ~ ^ E k u k (x,x / e) 
k = 0 

and showing that the leading term u° is a function of 
x alone, solving some kind of simpler equation. 

This example illustrates the insight that simpler 
behavior often appears in asymptotic limits. 

Asymptotic Matching 

Solutions of PDEs sometimes have quite different prop- 
erties in different subregions. When this happens we 
can try to fashion an approximate solution by (a) con- 
structing simpler approximate solutions in each sub- 
region and then (b) appropriately matching these solu- 
tions across areas of overlap. 

A common such application is to boundary > layers. 
The outer expansion for the solution within some region 
often has a form like 

OO 

u € (x) ~ ^ E k u k (x). (24) 

Suppose we expect different behavior near the bound- 
ary, which we take for simplicity to be the plane {x n = 
0}. We can then introduce the stretched variables y n = 
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x n / e 01 , yt = %i (i = 1, ... ,n - 1) and define u E (y) = 
u f (x). We then look for an inner expansion : 

00 

u E (y) ~ ^ s k u k (y). (25) 

k = o 

The idea now is to match terms in the outer expan- 
sion (24) in the limit x n — ■ 0 with terms in the inner 
expansion (25) in the limit y n -> co. Working this out 
determines, for instance, the value of a in the scaling. 

2.3 Numerical Analysis of PDEs 

Devising effective computer algorithms for PDEs is a 
vast enterprise— far beyond the scope of this article — 
and great ingenuity has gone into the design and 
implementation of such methods. 

Among the most popular are the finite-difference 
[IV. 1 3 § 3 ] methods (which approximate functions by 
values at grid points), the method of lines (which dis- 
cretizes all but the time variable), the finite-element 
method [11.12] and spectral methods (which repre- 
sent functions using carefully designed basis func- 
tions), multigrid methods [IV. 1 3 §3] (which employ 
discretizations across different spatial scales), and the 
level set method [11.24] (which represents free bound- 
aries as a level set of a function). 

The design and analysis of such useful numerical 
methods, especially for nonlinear equations, depends 
on a good theoretical understanding of the underlying 
PDE. 

2.4 Theory and the Importance of Estimates 

The fully rigorous theory of PDEs focuses largely on 
the foundational issues of the existence, smoothness, 
and, where appropriate, uniqueness of solutions. Once 
these issues are resolved, at least provisionally, theo- 
rists turn their attention to understanding the behavior 
of solutions. 

A key point is availability, or not, of strong analytic 
estimates. Many physically relevant PDEs predict that 
various quantities are conserved, but these identities 
are usually not strong enough to be useful, especially 
in three dimensions. For nonlinear PDEs the higher 
derivatives solve increasingly complicated, and thus 
intractable, equations. And so a major dynamic in 
modern theory is the interplay between (a) deriving 
“hard” analytic estimates for PDEs and (b) devising 
“soft” mathematical tools to exploit these estimates. 
In the remainder of this article we present for many 
important PDEs the key estimates upon which rigorous 
mathematical theory is built. 


3 The Behavior of Solutions 

Since PDEs model so vast a range of physical and other 
phenomena, their solutions display an even vaster 
range of behaviors. But some of these are more preva- 
lent than others. 

3.1 Waves 

Many PDEs of interest in applied mathematics support 
at least some solutions displaying “wavelike” behavior. 

The Wave Equation 

The wave equation is, of course, an example, as is most 
easily seen in one space dimension from d’Alembert’s 
formula (19). This dictates that the solution has the 
general form u(x,f ) = F(x + ct ) + G(x - ct) and is 
consequently the sum of right- and left-moving waves 
with speed c. The wavelike behavior encoded within 
Kirchhoff's formula (20) in three space dimensions is 
somewhat less obvious. 

Traveling Waves 

A solution n of a PDE involving time t and the single 
space variable x 6 I is a traveling wave if it has the 
form 

u(x,t) = v(x - at) (26) 

for some speed cr. More generally, a solution u of a PDE 
in more space variables having the form 

u(x, t) = v(y ■ x — crt) 

is a plane wave. An extremely useful first step for study- 
ing a PDE is to look for solutions with these special 
structures. 

Dispersion 

It is often informative to look for plane-wave solutions 
of the complex form 

u(x,t) =e i{ y x - ,rt) , (27) 

where cr e C and y£l". We plug the guess (27) into 
some given linear PDE in order to discover the so-called 
dispersion relationship between y and cr = cr(y) forced 
by the algebraic structure. 

For example, inserting (27) into the Klein-Gordon 
equation, 

utt ~ Au + m 2 u = 0, (28) 

gives cr = ±(|y| 2 + m 2 ) 1/2 . Hence the speed cr/\y\ of 
propagation depends nonlinearly on the frequency of 
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the initial data e ly ' x . So waves of different frequencies 
propagate at different speeds ; this is dispersion. 


Solitons 

As a nonlinear example, putting (26) into the KdV equa- 
tion (14) with a = 6 and b = 1 leads to the ordinary 
differential equation 

-crv' + Qvv' + v'" = 0, 
a solution of which is the explicit profile 
v(s) = |crsech 2 (\^/as) 

for each speed cr. The corresponding traveling wave, 
u(x, t) = v(x — crt), is called a soliton. 

3.2 Diffusion and Smoothing 

We can read off a lot of interesting quantitative infor- 
mation about the solution u of the initial-value problem 
(22) for the heat equation from the explicit formula (23). 

In particular, notice from (23) that, if the initial 
data function g is merely integrable, the solution u is 
infinitely differentiable in both the variables x and t at 
later times. So the heat equation instantly smooths its 
initial data ; this observation makes sense as the PDE 
models diffusive effects. 


3.3 Propagation Speeds 


It is also easy to deduce from (23) that, if u solves the 
heat equation, then values of the initial data g(y) at all 
points tel* contribute to determining the solution 
at (x, t ) for times t > 0. We can interpret this as an 
“infinite propagation speed” phenomenon. 

By contrast, for many time-dependent PDEs we have 
instead “finite propagation speed.” This means that 
some parts of the initial data do not affect the solu- 
tion at a given point in space until enough time has 
passed. This is so for first-order PDEs in general, for the 
wave equation, and remarkably also for some nonlinear 
diffusion PDEs, such as the porous medium equation 

u t - A(u y ) = 0 (29) 


with y > 1. The particular explicit solution 

y - 1 | vl 2 \ i/(y— !) 


u(x,t) = — (b - ^ 

t a \ 2y H t 2 f J 


(30) 


for 


n(y - 1) + 2’ n(y - 1) + 2’ 

and x+ = max{x, 0} shows clearly that the region of 
positivity moves outward at finite speed. 


3.4 Pattern Formation 

The interplay between diffusion and nonlinear terms 
can create interesting effects. For example, let <E(z) = 
| (z 2 - 1 ) 2 denote a “two-well” potential, having minima 
at z = ±1. Look now at this scalar reaction-diffusion 
problem in which e > 0 is a small parameter: 

u E t - Au E = \<P'(u E ) in M 2 x (0, oo), 

£ 2 

u E = g E onR 2 x {t = 0}. 

For suitably designed, initial data functions g E , it turns 
out that 

lim u E (x, t) = ± 1; 

f-0 

so the solution asymptotically goes to one or the other 
of the two minima of <E. We can informally think of 
these regions as being colored black and white. 

For each time t ^ 0, denote by E ( t ) the curve between 
the regions {u E (-, t ) -► 1} and {u E (-, t) 1}. Asymp- 

totic matching methods reveal that the normal velocity 
of E(t) equals its cur\>ature. This is a geometric law of 
motion for the evolving black/white patterns emerging 
in the asymptotic limit as e — ■ 0. 

Much more complex pattern formation effects can be 
modeled by systems of reaction-diffusion PDEs of the 
general form (9): see the article on pattern formation 
[IV.27] elsewhere in this volume. 

3.5 Blow-up 

Solutions of time-dependent PDEs may or may not exist 
for all future times, even if their initial conditions at 
time t = 0 are well behaved. Note, for example, that 
among solutions of the nonlinear heat equation 

ut - A u = u 2 , (31) 

subject to Neumann boundary conditions (16), are 
those solutions u = u(t ) that do not depend on x and 
consequently that solve the ordinary differential equa- 
tion u t = u 2 . It is not hard to show that solutions of this 
equation go to infinity (“blow up”) at a finite positive 
time, if m(0) > 0. 

For more general initial data, there is an interest- 
ing competition between the diffusive, and therefore 
stabilizing, term Au and the destabilizing term u 2 . 

3.6 Shocks 

As we have just seen, a solution of a time-dependent 
PDE can fail to exist for large times since its maximum 
may explode to infinity in finite time. But there are other 
mechanisms whereby a solution may cease to exist; it is 
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possible, for example, that a solution remains bounded, 
but its gradient becomes singular in finite time. 

This effect occurs for conservation laws (2). Consider, 
for example, the following initial-value problem for the 
BURGERS EQUATION [III.4]: 

ut + | (u 2 ) x = 0 inRx(0,co), 

u = g onlx{I = 0[. 

Assume we have a smooth solution u and define the 
characteristic curve x(t) to solve the ordinary differen- 
tial equation 

x(t) = u(x(t), t ) (t ^ 0), 
x(0) = xo. 

Then 

^-u(x(t), t) = u x {x(t),t)x(t) + u t (x(t),t) 

at 

= u x (x(t), t)u(x(t), t) + Ut(x(t), t) 

= o, 

according to the PDE (32). Consequently, u(x(f),t) = 
g (xo ) , and also the characteristic “curve” x ( t ) is in fact 
a straight line. 

So far, so good; and yet the foregoing often implies 
that the PDE does not in fact possess a smooth solution , 
existing for all times. To see this, notice that we can 
easily build initial data g for which the characteristic 
lines emanating from two distinct initial points cross 
at some later time, say at (x, t ). If we then use these 
two different characteristics to compute u(x, t), we will 
get different answers. This seeming paradox is resolved 
once we understand that the Burgers equation with the 
initial data g simply does not have a smooth solution 
existing until the time t. 

A major task for the rigorous analysis of the Burgers 
equation and related conservation laws is characteriz- 
ing surfaces of discontinuity (called shocks [11.30]) for 
appropriately defined generalized solutions. 

3.7 Free Boundaries 

Some very difficult problems require more than just 
finding the solution of some PDE: one must also find 
the region within which that PDE holds. Consider, for 
example, the Stefan problem, which asks us to deter- 
mine the temperature within some body of water sur- 
rounded by ice. The temperature distribution solves the 
heat equation inside a region whose shape changes in 
time as the ice melts and/or the water freezes. The 
unknowns are therefore both the temperature profile 
and the so-called free boundary of the water. 



There are in general two sorts of such free boundary 
problems that occur in PDE theory: those for which the 
free boundary is explicit, such as the Stefan problem, 
and those for which it is implicit. An example of the 
latter is the obstacle problem : 

minfw, -A u - /} = 0. 

The free boundary is 

E = d{u > 0}, 

along which the solution satisfies the overdetermined 
boundary conditions u = 0,du/dv = 0. Many important 
physical and engineering free boundary problems can 
be cast as obstacle problems. 

Much more complicated free boundary problems 
occur in fluid mechanics, in which the unknown veloc- 
ity m satisfies differing sorts of PDE within the sonic and 
subsonic regions. We say that the equations change ty'pe 
across the free boundary. 


4 Some Technical Methods 

So vast is the field of PDEs that no small handful of 
procedures can possibly handle them all. Rather, math- 
ematicians have discovered over the years, and con- 
tinue to discover, all sorts of useful technical devices 
and tricks. This section provides a selection of some of 
the most important. 


4.1 Transform Methods 


A panoply of integral transforms is available to convert 
linear, constant-coefficient PDEs into algebraic equa- 
tions. The most important is the Fourier transform 
[11.19]: 

u(y) '■= ut l ,„i 2 \ ^~ vcy u(x)dx. 

(2tt)”/ / Jr* 1 

Consider, as an example, the equation 

—Au + u = f inR". (33) 


We apply the Fourier transform and learn that (1 + 
\y\ 2 )u = f. This algebraic equation lets us easily find 
u, after which a somewhat tricky inversion yields the 
formula 


u(x) = 


1 


(4tt ) W 2 Jo 


c° 

. 0 Jr 


-s-(\x-y\ 2 !4s) 


-fly) Ay d5. 


JR» S n l 2 

Strongly related are Fourier series methods, which 
represent solutions of certain PDEs on bounded do- 
mains as infinite sums entailing sines and cosines. 
Another favorite is the laplace transform [11.19], 
which for PDEs is mostly useful as a transform in the 
time variable. 
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4.2 Energy Methods and the Functional Analytic 
Framework 

For many PDEs, various sorts of “energy estimates” are 
valid, where we use this term loosely to mean integral 
expressions involving squared quantities. 

Integration by Parts 

Important for what follows is the integration by parts 
formula : 

u x ,v dx = - uv x , dx + uuv'dS 

Ju Ju JdU 

for each i = 1 Here, v denotes the outward- 

pointing unit normal to the boundary. This is a form of 
the divergence theorem [1.2 §24] from multivariable 
calculus. 


Energy Estimates 


Assume that u solves Poisson's equation : 

-A u = f inR”. (34) 


Then, assuming that u goes to zero as |x| -> co fast 
enough to justify the integration by parts, we compute 
that 




tt-XiXi tixjXj dx 



Ux t XiXj U X J dx 



This identity implies something remarkable: if the 
Laplacian A u (which is the sum of the pure second 
derivatives u XiXi for i = 1, ... ,n) is square integrable, 
then each individual second derivative u XtXj for i,j = 
1, . . . , n is square integrable, even those mixed second 
derivatives that do not even appear in (34). 

This is an example of regularity' theory, which aims 
to deduce the higher integrability and/or smoothness 
properties of solutions. 


Time-Dependent Energy Estimates 

As a next example suppose that u = u(x,t) solves the 
wave equation (6), and define the energy at time t: 

e(t) := i f (u\ + c 2 ! Vu| 2 ) dx. 

2 


Then, assuming that u goes to zero as |x| -> co fast 
enough, we have 

e(t) = (u t u t t + c 2 Vu ■ Vut) dx 

Jr* 

= Ut(utt ~ c 2 Au) d.r = 0. 

jR n 

This demonstrates conservation of energy. 

For a nonlinear wave equation of the form 

Utt — Au + f (u) = 0, (35) 

a similar calculation works for the modified energy 

e(t) = f (ku 2 + b\ Vu| 2 + F(u)) dx, 

Ji» n 

where f = F' . 

4.3 Variational Problems 

By far the most successful of the nonlinear theories is 
the calculus of variations [IV.6]; indeed, a funda- 
mental question to ask of any given PDE is whether or 
not it is variational, meaning that it appears as follows. 

Given the Lagrangian density function L = L(v,z,x), 
we introduce the functional 

I[u]:= L(Vu,u,x) d.r, 

Ju 

defined for functions u: U -► M, subject to given 
boundary conditions that are not specified here. Sup- 
pose hereafter that u is a minimizer of /[■]. 

We will show that u automatically solves an appro- 
priate PDE. To see this, put i( t) := I[u + tv], where v 
vanishes near dU. Since i has a minimum at t = 0, we 
can use the chain rule to compute 

0 = i"(0) = f (V v L ■ Vv + L z v) dx; 

Ju 

and so 

0= f (- div(Vyl) + L z )v dx, 

Ju 

in which I is evaluated at (Vu,u,x). Here we write 
V v I = (E Vl L Vn ). 

This integral identity is valid for all functions v 
vanishing on 5(7, and from this the euler-lagrange 
equation [III. 12] follows: 

-div(V t ,L(Vu,u,x)) + L z (Vu,u,x) = 0. (36) 

The Nonlinear Poisson Equation 

For example, the Euler-Lagrange equation for 

I[u] = f l\Vu\ 2 - F(u) dx 
Ju 

is the nonlinear Poisson equation 
-A u = f(u), 

where f = F' . 


( 37 ) 
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Minimal Surfaces 


Nonlinear Elliptic Equations 


The surface area of the graph of a function u is 

I[u] = f (1 + |Vw| 2 ) 1/2 dx, 

Ju 

and the corresponding Euler-Lagrange equation is the 
minimal surface equation 

Hi+i w)-0 -°- 081 

The expression on the left-hand side is (n times) the 
mean curvature of the surface; and consequently, a 
minimal surface has zero mean curvature. 

4.4 Maximum Principles 

The integral energy methods just discussed can be aug- 
mented for certain PDEs with pointwise maximum prin- 
ciple techniques. These are predicated upon the ele- 
mentary observation that, if the function u attains its 
maximum at an interior point xo, then 

Uxi k (xo)=0, k = 1, . . . ,n, (39) 

and 

n 

X u XkXl (x o)g*Si ^ 0, § e M". (40) 

k,l= 1 

Linear Elliptic Equations 

Such insights are essential for understanding the gen- 
eral second-order linear elliptic equation 


Maximum principle techniques also apply to many 
highly nonlinear equations, such as the Hamilton- 
Jacobi-Bellman equation: 

max { L k u } = 0. (42) 

k=l,...,m 

This is an important equation in stochastic optimization 
theory, in which each elliptic operator L k is the infinites- 
imal generator of a different stochastic process. We 
leave it to the reader to use the maximum principle to 
show that a solution of (42) cannot attain an interior 
maximum or minimum. 

Related, but much more sophisticated, maximum 
principle arguments can reveal many of the subtle 
properties of solutions to the linear elliptic equation 
(41) and the nonlinear equation (42). 

4.5 Differential Inequalities 

Since solutions of PDEs depend on many variables, 
another useful trick is to design appropriate integral 
expressions over all but one of these variables, in the 
hope that these expressions will satisfy interesting 
differential inequalities in the remaining variable. 

Dissipation Estimates and Gradient Flows 

For example, let u = u(x, t) solve the nonlinear gradi- 
ent flow equation 

u t - div(VL(Vw)) = 0 (43) 


In = 0, (41) 

where 

n n 

Lu = - X ct l Hx)u XiXj + X b l Mu Xi + c(x)u. 

ij=l i=l 

We say I is elliptic provided the symmetric matrix 
((a 1J (x))) is positive-definite. In usual applications u 
represents the density of some quantity. The second- 
order term X?/=i u XiXj records diffusion, the first- 
order term Xf=i h i u Xi represents transport, and the 
zeroth-order term cu describes the local increase or 
depletion. 

We use the maximum principle to show, for instance, 
that if c > 0 then u cannot attain a positive maximum 
at an interior point. Indeed, if u took on a positive max- 
imum at some point xo, then the first term of In at xo 
would be nonnegative (according to (40)), the next term 
would be zero (according to (39)), and the last would be 
positive. But this is a contradiction, since Lu(xq) = 0. 


x (0, oo ). Put 

e(t) := 


I(Vu) dx. 


Then, assuming that u goes to zero rapidly as |x| — • oo, 
we have 

e(t) = f VI(Vu) ■ Vut dx 


= - (div VL(Vw))u t dx 

Jw" 

= - | (u t ) 2 dx ^ 0. 

J K" 

This is a dynamic dissipation inequality. 


Entropy Estimates 

Related to dissipation inequalities are entropy esti- 
mates for conservation laws. To illustrate these, as- 
sume that u £ = u E {x, t) solves the viscous conservation 
law 

Uf + F(u E ) x = EU X 


(44) 
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for a > 0. Suppose $ is a convex function and put 

e(t) := <P(u £ ) dx. 

Jr 


Then 

e(t) = <P , Ufdx= <P'(-F X + au xx ) dx 

Jr Jr 

= -f (¥(u £ ) x + a<P" (u x ) 2 ) dx 

Jr 

= - f a<P"(u x ) 2 dx ^ 0, 

Jr 

where ¥ satisfies ¥' = <P'F' . What is important is that 
we have found not just one but rather a large collection 
of dissipation inequalities, corresponding to each pair 
of entropy/entropy flux functions (<P,¥). 

Finding and utilizing entropy/entropy flux pairs for 
systems of conservation laws of the form (8) is a major 
challenge. 


Monotonicity > Formulas 


For monotonicity formulas we try to find interesting 
expressions to integrate over balls B(0, r), with center 
0, say, and radius r. The hope is that these integral 
quantities will solve useful differential inequalities as 
functions of r. 

As an example, consider the system 

-Am = \Du\ 2 u, \u \ 2 = 1 (45) 

for the unknown u = (u 1 , . . . , u m ), where we write 
|Dw! 2 = X?=i l?=i(ui) 2 . A solution u is called a har- 
monic map into the unit sphere. It is a challenging 
exercise to derive from (45) the differential inequality 


d 

dr 




5 ( 0 , r) 



2 !_ 

yYl 



u x .XiU Xj Xj d,S' / 0, 


from which we deduce that 



if 0 < r < R. This inequality is often useful, as it lets 
us deduce fine information at small scales r from that 
at larger scales R. 


5 Theory and Application 

The foregoing listing of mathematical viewpoints and 
technical tricks provides at best a glimpse into the 
immensity of modern PDE theory, both pure and 
applied. 


5.1 Well-Posed Problems 

A common goal of most of these procedures is to 
understand a given PDE (plus appropriate boundary 
and/or initial conditions) as a well-posed problem, 
meaning that (a) the solution exists, (b) it is unique, 
and (c) it depends continuously on the given data for 
the problem. This is usually the beginning of wisdom, 
as well-posed problems provide the starting point for 
further theoretical inquiry, for numerical analysis, and 
for construction of approximate solutions. 

5.2 Generalized Solutions 

A central theoretical problem is therefore to fashion 
for any' given PDE problem an appropriate notion of 
solution for which the problem is well-posed. For lin- 
ear PDEs the concept of “distributional solutions” is 
usually the best, but for nonlinear problems there are 
many, including “viscosity solutions,” “entropy solu- 
tions,” “renormalized solutions,” etc. 

For example, the unique entropy solution of the 
initial-value problem (2) for a scalar conservation law 
exists for all positive times, but it may support lines 
of discontinuities across so-called shock waves. Simi- 
larly, the unique viscosity solution of the initial-value 
problem for the Hamilton-Jacobi equation (3) gener- 
ally supports surfaces of discontinuity for its gradient. 
The explicit solution (30) for the porous medium equa- 
tion is, likewise, not smooth everywhere and so needs 
suitable interpretation as a valid generalized solution. 

The research literature teems with many such no- 
tions, and some of the deepest insights in the field 
are uniqueness theorems for appropriate generalized 
solutions. 

5.3 Learning More 

As promised, this article is a wide-ranging survey that 
actually explains precious little in any detail. 

To learn more, interested readers should definitely 
consult other articles in this volume as well as the fol- 
lowing suggested reading. Markowich (2007) is a nice 
introduction to the subject, with lots of pictures, and 
Strauss (2008) is a very good undergraduate text, con- 
taining derivations of the various formulas cited here. 
The survey article by Klainerman (2008) is extensive 
and provides some different viewpoints. My graduate- 
level textbook (Evans 2010) carefully builds up much 
of the modern theory of PDEs, but it is aimed at 
mathematically advanced students. 
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IV.4 Integral Equations 

Rainer Kress 


1 Introduction 

Some forty years ago when I was working on my thesis, 
I fell in love with integral equations, one of the most 
beautiful topics in both pure and applied analysis. This 
article is intended to stimulate the reader to share this 
love with me. 

The term integral equation was first used by Paul 
du Bois-Reymond in 1888 for equations in which an 
unknown function occurs under an integral. Typical 
examples of such integral equations are 

f K(x,y)<t>(y) Ay = fix) (1) 

Jo 

and 

4>(x) + \ K(x,y)4>iy) dv = fix). (2) 
Jo 

In these equations the function <p is the unknown, and 
the kernel K and the right-hand side / are given func- 
tions. Solving one of these integral equations amounts 
to determining a function <p such that the equation is 
satisfied for all x with 0 ^ x ^ 1. Equations (1) and (2) 
carry the name of Ivar Fredholm and are called Fred- 
holm integral equations of the first and second kind, 
respectively. In the first equation the unknown function 


only occurs under the integral, whereas in the second 
equation it also appears outside the integral. Later on 
we will show that this is more than just a formal differ- 
ence between the two types of equations. A first impres- 
sion of the difference can be obtained by considering 
the special case of a constant kernel K(x,y) = c * 0 
for all x,y £ [0, 1]. On the one hand, it is easily seen 
that the equation of the second kind (2) has a unique 
solution given by 

<p(x) = f(x) - — f fly) dy 
1 + c Jo 

provided c =t= -1. If c = — 1 then (2) is solvable if and 
only if Jo fly) dy = 0, and the general solution is given 
by <f> = f +y with an arbitrary constant y . On the other 
hand, the equation of the first kind (1) is solvable if and 
only if / is a constant: fix) = y for all x, say. In this 
case every function <fi with mean value y is a solution. 

The integration domains in (1) and (2) are not 
restricted to the interval [0,1]. In particular, the inte- 
gration domain can be multidimensional, and for the 
integral equation of the first kind, the domain in which 
the equation is required to be satisfied need not coin- 
cide with the integration domain. 

The first aim of this article is to guide the reader 
through part of the historical development of the 
theory and the applications of these equations. In par- 
ticular, we discuss their close connection to partial dif- 
ferential equations and emphasize their fundamental 
role in the early years of the development of functional 
analysis as the appropriate abstract framework for 
studying integral (and differential) equations. Then, in 
the second part of the article, we will illustrate how inte- 
gral equations play an important role in current math- 
ematical research on inverse and ill-posed problems 
in areas such as medical imaging and nondestructive 
evaluation. 

Two mathematical problems are said to be inverse 
to each other if the formulation of the first problem 
contains the solution of the second problem, and vice 
versa. According to this definition, at first glance it 
seems arbitrary to distinguish one of the two problems 
as an inverse problem. However, in general, one of the 
two problems is easier and more intensively studied, 
while the other is more difficult and less explored. The 
first problem is then denoted as the direct problem, and 
the second as the inverse problem. 

A wealth of inverse problems arise in the mathe- 
matical modeling of noninvasive evaluation and imag- 
ing methods in science, medicine, and technology. For 
imaging devices such as conventional X-rays or X-ray 
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tomography, the direct problem consists of determin- 
ing the images, i.e., two-dimensional projections of 
the known density distribution on planar photographic 
films in conventional X-ray devices and projections 
along all lines through the object measured via inten- 
sity losses in X-ray tomography (for the latter see also 
section 8). Conversely, the inverse problem demands 
that we reconstruct the density from the images. More 
generally, inverse problems answer questions about the 
cause of a given effect, whereas in the corresponding 
direct problem the cause is known and the effect is 
to be determined. A common feature of such inverse 
problems is their ill-posedness, or instability, i.e., small 
changes in the measured effect may result in large 
changes in the estimated cause. 

Equations (1) and (2) are linear equations since 
the unknown function <fi appears in a linear fashion. 
Though nonlinear integral equations also constitute 
an important part of the mathematical theory and the 
applications of integral equations, we do not consider 
them here. 


2 Abel’s Integral Equation 


As an appetizer we consider Abel’s integral equation. It 
was one of the first integral equations in mathematical 
history. A tautochrone is a planar curve for which the 
time taken by an object sliding without friction in uni- 
form gravity to reach its lowest point is independent of 
its starting point. The problem of identifying this curve 
was solved by Christiaan Huygens in 1659, who, using 
geometrical tools, established that the tautochrone is a 
cycloid. 

In 1823 Niels Henrik Abel attacked the more gen- 
eral problem of determining a planar curve such that 
the time of descent for a given starting height y coin- 
cides with the value fly) of a given function f. The 
tautochrone then reduces to the special case when / 
is a constant. Following Abel we describe the curve by 
x = Lp(y) (with <f/(0) = 0) and, using the principle of 
conservation of energy, obtain 

fly) = [ -p== dp, y > 0, (3) 

Jo fY^rj 

for the total time fly) required for the object to fall 
fromT = !ip!y),y) toPo = (0,0), where 


<t>-= 


1 + 0P') 

2 g 


2 


and g denotes the acceleration due to gravity. Equa- 
tion (3) is known as Abel’s integral equation. Given the 


shape <p, the falling time / is obtained by simply eval- 
uating the integral on the right-hand side of (3). How- 
ever, the solution of the generalized tautochrone prob- 
lem requires the solution of the inverse problem; that 
is, given the function /, the solution <p of the integral 
equation (3) has to be found, which is certainly a more 
challenging task. This solution can be shown to be given 
by 


fly) 


I A r fin) 
TT dy Jo yy^q 


dp, 


y > 0. 


(4) 


For the special case of a constant / = n^faj2g with 
a > 0 one obtains from (4), after some calculations, 
that [(p'(>’)] 2 = ialy) - 1, and it can be seen that the 
solution of this equation is given by the cycloid with 
parametric representation 

ixit),yit)) = *a(t + sinf, 1 - cost), 0 ^ t ^ tt. 


3 The Early Years 

We proceed by giving a brief account of the close 
connections between the early development of inte- 
gral equations and potential theory. For the sake of 
simplicity we confine the presentation to the two- 
dimensional case as a model for the practically relevant 
three-dimensional case. In what follows, x = (xi,X 2 ) 
and y = iyi,y 2 ) stand for points or vectors in the 
Euclidean space R 2 . Twice continuously differentiable 
solutions u of Eaplace’s equation 

3 2 u d 2 u 
3x 2 3x 2 

are called harmonic functions [111.18 §1]. They mod- 
el time-independent temperature distributions, poten- 
tials of electrostatic and magnetostatic fields, and 
velocity potentials of incompressible irrotational fluid 
flows. 

For a simply connected bounded domain D in R 2 
with smooth boundary E := 3D the Dirichlet problem 
of potential theory consists of Ending a harmonic func- 
tion u in D that is continuous up to the boundary and 
assumes boundary values u = f on E for a given con- 
tinuous function / on E. A first approach to this prob- 
lem, developed in the early nineteenth century, was to 
create a so-called single-layer potential by distributing 
point sources with a density <p on the boundary curve 
T, i.e., by looking for a solution in the form 

u(x) = J <piy)ln\x - y\ dsiy), xeD. (5) 

Here, | ■ | denotes the Euclidean norm, i.e., x - y rep- 
resents the distance between the two points x in D and 
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y on f. Since In \x - y\ satisfies Laplace’s equation if 
x * y, the function u given by (5) is harmonic and in 
order to satisfy the boundary condition it suffices to 
choose the unknown function <fi as a solution of the 
integral equation 

| 4>(y) In \x - y\ d5(y) = f(x), xeF, (6) 

which is known as Symm’s integral equation. However, 
the analysis available at that time did not allow a suc- 
cessful treatment of this integral equation of the first 
kind. Actually, only in the second half of the twenti- 
eth century was a satisfying analysis of (6) achieved. 
Therefore, it represented a major breakthrough when, 
in 1856, August Beer proposed to place dipoles on the 
boundary curve, i.e., to look for a solution in the form 
of a double-layer potential: 

u(x) = <p(y) 81 g|^ v) 3 ^ d5(y), xeD, (7) 

where v denotes the unit normal vector to the boundary 
curve r directed into the exterior of D. Now the so- 
called jump relations from potential theory require that 

4>(x) + — [ <ft(.y) 51 3 ^ d5(y) = — f(x) (8) 

tt Jr ov(y) TT 

is satisfied for x e F in order to fulfill the bound- 
ary condition. This is an integral equation of the sec- 
ond kind and as such, in principle, was accessible to 
the method of successive approximations. However, 
in order to achieve convergence for the case of con- 
vex domains, in 1877 Carl Neumann had to modify 
the successive approximations into what he called the 
method of arithmetic means and what we would call a 
relaxation method in modern terms. 

For the general case, establishing the existence of a 
solution to (8) had to wait until the pioneering results 
of Fredholm that were published in final form in 1903 
in the journal Acta Mathematica with the title “Sur une 
classe d’equations fonctionelles.” Fredholm considered 
equations of the form (2) with a general kernel K and 
assumed all the functions involved to be continuous 
and real-valued. His approach was to consider the inte- 
gral equation as the limiting case of a system of linear 
algebraic equations by approximating the integral by 
Riemannian sums. Using Cramer’s rule for this linear 
system, Fredholm passes to the limit by using Koch’s 
theory of infinite determinants from 1896 and Hada- 
mard’s inequality for determinants from 1893. The idea 
of viewing integral equations as the limiting case of 
linear systems had already been proposed by Volterra 
in 1896, but it was Fredholm who followed it through 
successfully. 


In addition to equation (2), Fredholm’s results also 
contain the adjoint integral equation that is obtained by 
interchanging the variables in the kernel function. They 
can be summarized in the following theorem, which is 
known as the Fredholm alternative. Note that all four 
of equations (9)-(12) in Theorem 1 are required to be 
satisfied for all 0 ^ x ^ 1. 

Theorem 1. Either the homogeneous integral equa- 
tions 

4>(x) + f K(x,y)4>(y) dy = 0 (9) 

Jo 

and 

qi(x) + f K(y,x)qj(y)dy = 0 (10) 

Jo 

only have the trivial solutions <fi = 0 and qi = 0, and 
the inhomogeneous integral equations 

+ f K(x,y)(j)(y)dy = f(x) (11) 
Jo 

and 

< p(x)+\ K(y,x)qj(y) dy = g(x) (12) 
Jo 

have unique continuous solutions <p and qj for each 
continuous right-hand side f and g, respectively, or 
the homogeneous equations (9) and (10) have the same 
finite number of linearly independent solutions and the 
inhomogeneous integral equations are solvable if and 
only if the right-hand sides satisfy Jo / (x)qi(x) dx = 0 
for all solutions qj to the homogeneous adjoint equa- 
tion (10) and Jo <fi(x)g(x) dx = 0 for all solutions <j> to 
the homogeneous equation (9). 

We explicitly note that this theorem implies that 
for the first of the two alternatives each one of the 
four properties implies the three others. Hence, in 
particular, uniqueness for the homogeneous equation 
(9) implies existence of a solution to the inhomogen- 
eous equation (11) for each right-hand side. This is 
notable, since it is almost always much simpler to 
prove uniqueness for a linear problem than to prove 
existence. 

Fredholm's existence results also clarify the exis- 
tence of a solution to the boundary integral equation 
(8) for the Dirichlet problem for the Laplace equation. 
By inserting a parametrization of the boundary curve 
r we can transform (8) into the form (2) with a contin- 
uous kernel for which the homogeneous equation only 
allows the trivial solution. 

Over the last century this boundary integral equa- 
tion approach via potentials of the form (5) and (7) 
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has been successfully extended to almost all bound- 
ary and initial-boundary-value problems for second- 
order partial differential equations with constant coef- 
ficients, such as the time-dependent heat equation, 
the time-dependent and the time-harmonic wave equa- 
tions, and Maxwell’s equations, among many others. 
In addition to settling existence of solutions, bound- 
ary integral equations provide an excellent tool for 
obtaining approximate solutions of the boundary and 
initial-boundary-value problems by solving the integral 
equations numerically (see section 5). These so-called 
boundary element methods compete well with finite- 
element methods. It is an important part of current 
research on integral equations to develop, implement, 
and theoretically justify new efficient algorithms for 
boundary integral equations for very complex geome- 
tries in three dimensions that arise in real applications. 

4 Impact on Functional Analysis 

Fredholm’s results on the integral equation (2) initiated 
the development of modern functional analysis in the 
1920s. The almost literal agreement of the Fredholm 
alternative for linear integral equations as formulated 
in Theorem 1 with the corresponding alternative for lin- 
ear systems soon gave rise to research into a broader 
and more abstract form of the Fredholm alternative. 
This in turn also allowed extensions of the integral 
equation theory under weaker regularity requirements 
on the kernel and solution functions. In addition, many 
years later it was found that more insight was achieved 
into the structure of Fredholm integral equations by 
abandoning the initially very fruitful analogy between 
integral equations and linear systems altogether. 

Frigyes Riesz was the first to find an answer to the 
search for a general formulation of the Fredholm alter- 
native. In his work from 1916 he interpreted the inte- 
gral equation as a special case of an equation of the 
second kind, 

4>+A<p = f, 

with a compact linear operator A: X -» X mapping a 
normed space X into itself. The notion of a normed 
space that is common in today’s mathematics was not 
yet available in 1916. 

Riesz set his work up in the function space of contin- 
uous real-valued functions on the interval [0, 1] — what 
we would call the space C[0, 1] in today’s terminology. 
He called the maximum of the absolute value of a func- 
tion / on [0, 1] the norm of / and confirmed its prop- 
erties that we now know as the standard norm axioms. 


Riesz used only these axioms, not the special meaning 
as the maximum norm. 

The concept of a compact operator was not yet 
available in 1916 either. However, using the notion of 
compact sets as introduced by Frechet in 1906, Riesz 
proved that the integral operator A defined by 

C A<t>)(x):=\ K(.x,y)<t>(y)dy, xe[0, 1], (13) 

Jo 

on the space C[0, 1] maps bounded sets into rela- 
tively compact sets, i.e., in today's terminology, A is 
a compact operator. 

What is fascinating about the work of Riesz is that his 
proofs are still usable and can be transferred, almost 
unchanged, from the case of an integral operator in the 
space of continuous functions to the general case of a 
compact operator in a normed space. Riesz knew about 
the generality of his method, explicitly noting that the 
restriction to continuous functions was not relevant. 

Summarizing the results of Riesz gives us the follow- 
ing theorem, in which I denotes the identity operator. 

Theorem 2. For a compact linear operator A'. X — • X in 
a normed space X, either I + A is injective and surjective 
and has a bounded inverse or the null space N (I + A) := 
{<fi: 4> + A<p = 0} has nonzero finite dimension and the 
image space (I + A)(X) is a proper subspace of X. 

The central and most valuable part of Riesz’s theory 
is again the equivalence of injectivity and surjectivity. 
Theorem 2 does not yet completely contain the alterna- 
tive of Theorem 1 for Fredholm integral equations since 
a link with an adjoint equation and the characterization 
of the image space in the second case of the alternative 
are missing. This gap was closed by results of Schauder 
from 1929 and by more recent developments from the 
1960s. 

The following theorem is simply a consequence of 
the fact that the identity operator on a normed space 
is compact if and only if X has finite dimension, which 
we refrain from discussing here. It explains why the 
difference between the two integral equations (1) and 
(2) is more than just formal. 

Theorem 3. Let X and Y be normed spaces and let 
A: X — Y be a compact linear operator. Then A cannot 
have a bounded inverse if X is infinite dimensional. 

5 Numerical Solution 

The idea for the numerical solution of integral equa- 
tions of the second kind that is conceptually the most 
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straightforward dates back to Nystrom in 1930 and 
consists of replacing the integral in (2) by numerical 
integration. Using a quadrature formula 
1 n 

g(x) dx ~ Y a k g(x k ) 

3 fc =i 

with quadrature points xi,...,x n £ [0, 1] and quadra- 
ture weights ai,...,a„ e R, such as the composite 
trapezoidal or composite Simpson rule, we approxi- 
mate the integral operator (13) by the numerical inte- 
gration operator 

n 

(A w 0)(x):= X a k K(x,x k )4>(Xk) (14) 

k= 1 

for x G [0, 1], i.e., we apply the quadrature formula for 
g = K)x, -)4>. The solution to the integral equation of 
the second kind, f + A<p = /, is then approximated 
by the solution of 4> n + A n <p n = /, which reduces to 
solving a Unite-dimensional linear system as follows. If 
<p n is a solution of 
n 

4>nM + X ClkK(x,Xk)<t>n(Xk) = fM (15) 

k = 1 

for x G [0, 1], then clearly the values f n j := fniXj) 
at the quadrature points satisfy the linear system 
n 

4>nJ + X a kK(Xj,Xk)<l> n ,k = f(Xj) (16) 
k= 1 

for j = 1, . . . , n. Conversely, if f n ,j is a solution of the 
system (16), the function <p n defined by 
n 

fnix) := f(x) - X akK(x,X k )<pn,k (17) 

k = 1 

for x G [0,1] can be seen to solve equation (15). Under 
appropriate assumptions on the kernel K and the right- 
hand side / for a convergent sequence of quadrature 
rules, it can be shown that the corresponding sequence 
(<pn) of approximate solutions converges uniformly to 
the solution <p of the integral equation as n — oo. Fur- 
thermore, it can be established that the error estimates 
for the quadrature rules carry over to error estimates 
for the Nystrom approximations. 

We conclude this short discussion of the numeri- 
cal solution of integral equations by pointing out that 
in addition to the Nystrom method many other meth- 
ods are available, such as collocation and Galerkin 
methods. 

6 Ill-Posed Problems 

In 1923 Hadamard postulated three requirements for 
problems in mathematical physics: a solution should 


exist, the solution should be unique, and the solution 
should depend continuously on the data. The third pos- 
tulate is motivated by the fact that the data will be 
measured quantities in applications and will therefore 
always be contaminated by errors. A problem satisfying 
all three requirements is called well-posed. Otherwise, it 
is called ill-posed. If A \ X — ■ Y is a bounded linear oper- 
ator mapping a normed space X into a normed space Y, 
then the equation Af = f is well-posed if A is bijective 
and the inverse operator A -1 : Y — ■ X is bounded, i.e., 
continuous. Otherwise it is ill-posed. The main concern 
with ill-posed problems is instability, where the solu- 
tion <p of A<p = f does not depend continuously on the 
data /. 

As an example of an ill-posed problem we present 
backward heat conduction. Consider the forward heat 
equation 

du d 2 u 
d t dx 2 

for the time-dependent temperature u in a rectangle 
[0,1] x [0, T] subject to the homogeneous boundary 
conditions 

u(0, t) = u(l, t) = 0, O^t^T, 
and the initial condition 

u(x, 0) = <p(x), 0 ^ x ^ 1, 

where f> is a given initial temperature. By separation of 
variables the solution can be obtained in the form 

OO 

u(x,t) = X £t n eA n n ' t sinnnx, (18) 

n= 1 

with the Fourier coefficients 

a n = 2 <p(y) shinny dy (19) 

Jo 

of the given initial values. This initial-value problem is 
well-posed: the final temperature / := u(- ,T) clearly 
depends continuously on the initial temperature <fi 
because of the exponentially decreasing factors in the 
series 

fix) = X a n^ n2nT sinnnx. ( 20 ) 

n= 1 

However, the corresponding inverse problem, i.e., de- 
termination of the initial temperature <p from know- 
ledge of the final temperature /, is ill-posed. From (20) 
we deduce 

OO 

fix) = X b n e n ~ n ~ T sinriTTX, (21) 

n= 1 

with the Fourier coefficients b n of the final temperature 
/. Changes in the final temperature will be drastically 
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amplified by the exponentially increasing factors in the 
series (21). 

Inserting (19) into (18), we see that this example can 
be put in the form of an integral equation of the first 
kind (1) with the kernel given by 

00 

K(x,y) = 2 ^ e~ n ~ n ~ T smnnx shinny . 

n= 1 

In general, integral equations of the first kind that 
have continuous kernels provide typical examples of 
ill-posed problems as a consequence of Theorem 3. 

Of course, the ill-posed nature of an equation has 
consequences for its numerical solution. The fact that 
an operator does not have a bounded inverse means 
that the condition numbers of its finite-dimensional 
approximations grow with the quality of the approx- 
imation. Hence, a careless discretization of ill-posed 
problems leads to a numerical behavior that at first 
glance seems to be paradoxical: increasing the degree 
of discretization (that is, increasing the accuracy of 
the approximation for the operator) will cause the 
approximate solution to the equation to become more 
unreliable. 

7 Regularization 

Methods for obtaining a stable approximate solution of 
an ill-posed problem are called regularization methods. 
It is our aim to describe a few ideas about regulariza- 
tion concepts for equations of the first kind with a com- 
pact linear operator A: X -* Y between two normed 
spaces, X and Y. We wish to approximate the solution 
4> to the equation A<fi = f from a perturbed right-hand 
side f 5 with a known error level \\f 5 - f\\ ^ 5. Using 
the erroneous data f 5 , we want to construct a reason- 
able approximation <p s to the exact solution <p of the 
unperturbed equation A<p = /. Of course, we want this 
approximation to be stable, i.e., we want 4> 5 to depend 
continuously on the actual data f 5 . Therefore, assum- 
ing without major loss of generality that A is injective, 
our task is to find an approximation of the unbounded 
inverse operator A -1 : A(X) — • X by a bounded lin- 
ear operator R: Y -> X. With this in mind, a family of 
bounded linear operators R„\ Y — ■ X, a > 0, with the 
property of pointwise convergence 

limR«A(j> = <fi (22) 

£X — 0 

for all <p 6 X, is called a regularization scheme for the 
operator A. The parameter ex is called the regulariza- 
tion parameter. 


The regularization scheme approximates the solu- 
tion <p of A<fi = f by the regularized solution 4>„ := 
R a f 5 - For the total approximation error by the triangle 
inequality we then have the estimate 

II <P 5 oi - <£ll ^ <5||-R«II + II R«A4> - 4>\\. 

This decomposition shows that the error consists of 
two parts: the first term reflects the influence of the 
incorrect data, and the second term is due to the 
approximation error between R a and A -1 . Assuming 
that X is infinite dimensional, R a cannot be uniformly 
bounded, since otherwise A would have a bounded 
inverse. Consequently, the first term will be increasing 
as a — ■ 0, whereas the second term will be decreasing 
as « — ■ 0 according to (22). 

Every regularization scheme requires a strategy for 
choosing the parameter a, depending on the error level 
5 and the data f s , so as to achieve an acceptable total 
error for the regularized solution. On the one hand, the 
accuracy of the approximation requires a small error 
|| R a A4> - <p \ |; this implies a small parameter a. On the 
other hand, stability requires a small value of \\R a || ; this 
implies a large parameter a. A popular strategy is given 
by the discrepancy principle. Its motivation is based on 
the consideration that, in general, for erroneous data 
the residual || A4> s a - f s \\ should not be smaller than 
the accuracy of the measurements of /, i.e., the reg- 
ularization parameter a should be chosen such that 
\\AR a f s -f s \\ « 5. 

We now assume that X and Y are Hilbert spaces and 
denote their inner products by (-,-), with the space 
I 2 [0, 1] of Lebesgue square-integrable complex-valued 
functions on [0, 1] as a typical example. Each bounded 
linear operator A : X — Y has a unique adjoint operator 
A*: Y -> X with the property (A<p,g) = (<p,A*g) for 
all <fi e X and g e Y. If A is compact, then A* is also 
compact. The adjoint of the compact integral operator 
A: I 2 [0, 1] — ■ I 2 [0, 1] defined by (13) is given by the 
integral operator with the kernel K (y, x), where the bar 
indicates the complex conjugate. 

Extending the singular value decomposition (SVD) 
for matrices from linear algebra, for each compact 
linear operator A: X -» Y there exists a singular 
system consisting of a monotonically decreasing null 
sequence (p n ) of positive numbers and two orthonor- 
mal sequences (<p n ) in X and (g n ) in Y such that 

Atfin — Pngnt A* g n = Pn<fini U E N. 

For each <p e X we have the SVD 

00 

4> = Y, (<£, 4>n)4>n + P<t>, 
n = 1 
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where P\ X —■ N(A) is the orthogonal projection 
operator onto the null space of A and 

00 

A(f> = X p n (<p, 4 > n)gn- 

n = 1 

From the SVD it can be readily deduced that the equa- 
tion of the first kind, A<fi = /, is solvable if and only if 
/ is orthogonal to the null space of A* and satisfies 

X \\(f,0n)\ 2 < (23) 

n=l W 

If (23) is fulfilled, a solution is given by 
“ 1 

0 = X (f,0n)4>n- (24) 

ra = 1 

The solution (24) clearly demonstrates the ill-posed 
nature of the equation A<p = f ■ If we perturb the right- 
hand side f to f 6 = f + 8g n , we obtain the solution 
<p s = <fi + 8hn 1 <Pn- Flence, the ratio 

110* -011 = J_ 

II/ 5 -/II hn 

can be made arbitrarily large due to the fact that the 
singular values tend to zero. This observation suggests 
that we regularize by damping the influence of the fac- 
tor 1 / p n in the solution formula (24). In the Tikhonov 
regularization this is achieved by choosing 
00 

R af .= X (25) 

Computing R a f does not require the singular system 
to be used since for injective A it can be shown that 

R a = (al + A* Ay 1 A*. 

Hence <p a '■= Raf can be obtained as the unique 
solution of the well-posed equation of the second kind: 

a<t>u + A*A(f> a = A*f. 

8 Computerized Tomography 

In transmission computerized tomography [VII. 19] 
a cross section of an object is scanned by a thin X-ray 
beam whose intensity loss is recorded by a detector 
and processed to produce an image. Denote by / the 
space-dependent attenuation coefficient within a two- 
dimensional medium. The relative intensity loss of an 
X-ray along a straight line L is given by dJ = -If ds, 
and by integration, it follows that 

I detector — I source CXp ^ J ,/ d.S'^j , 

i.e., in principle, the scanning process provides the line 
integrals over all lines traversing the scanned cross 


section. The transform that maps a function in R 2 
onto its line integrals is called the Radon transform , 
and the inverse problem of computerized tomogra- 
phy requires its inversion. Radon had already given 
an explicit inversion formula in 1917, but it is not 
immediately applicable for practical computations. 

For the formal description of the Radon transform 
it is convenient to parametrize the line L by its unit 
normal vector 0 and its signed distance s from the ori- 
gin in the form L = {50 + t0 L : t e R}, where 0 L is 
obtained by rotating 0 counterclockwise by 90°. Now, 
the two-dimensional Radon transform R is defined by 

(Rf)(0,s) := f f(s0 + t0 ± )dt, 0 G S 1 , 5 E R, 

J —00 

and it maps I^R 2 ) into L 1 (S 1 x R), where X 1 is the 
unit circle. Given the measured line integrals g, the 
inverse problem of computerized tomography consists 
of solving 

Rf = g ( 26 ) 

for /. Although it is not of the conventional form (1) or 
(2), equation (26) can clearly be viewed as an integral 
equation. Its solution can be obtained using Radon’s 
inversion formula 

f -h R ' H h Rf l27 » 

with the Hilbert transform 

( Hg)(s):= - [“ dfff dt , 5 e R, 

7T J -00 5 - t 

applied with respect to the second variable in Rf. The 
operator R* is the adjoint of R with respect to the L 2 
inner products on R 2 and S 1 x R, which is given by 

(R*g){x) = f g(0,x ■ 0) d0, xeR 2 , 

Js 1 

i.e., it can be considered as an integration over all lines 
through x and is therefore called the back-projection 
operator. Because of the occurrence of the Hilbert 
transform in (27), inverting the Radon transform is not 
local; i.e., the line integrals through a neighborhood 
of the point x do not suffice for the reconstruction 
of fix). Due to the derivative appearing in (27), the 
inverse problem of reconstructing the function / from 
its line integrals is ill-posed. 

In practice, the integrals can be measured only for 
a finite number of lines and, correspondingly, a dis- 
crete version of the Radon transform has to be inverted. 
The most widely used inversion algorithm is the filtered 
back-projection algorithm, which maybe considered as 
an implementation of Radon’s inversion formula with 
the middle part H(d/ds) replaced by a convolution, 
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i.e., a filter in the terminology of image processing. 
However, so-called algebraic reconstruction techniques 
are also used where the function / is decomposed 
into pixels, i.e., where it is approximated by piece- 
wise constants on a grid of little squares. The resulting 
sparse linear system for the pixel values is then solved 
iteratively, by Kaczmarz’s method, for example. 

For the case of a radially symmetric function / (that 
is, when fix) = fo(\x\)), Rf clearly does not depend 
on 0 (that is, ( R f)(0,s) = gois)), where 

r 00 

g 0 (s) = 2\ / 0 (Vs 2 + t 2 ) d£, O 0. 

Jo 

Substituting t = fr 2 - s 2 , this transforms into 

go(s) = l\ fo(r) — dr , s ^ 0, 

Js fr 2 - s- 

which is an Abel-type integral equation again. Its solu- 
tion is given by 

If 00 1 

fo(r)=-—\ g'o(s) -== = ^d5, r ^ 0. 

tt J r fs 2 - r 2 

This approach can be extended to a full inversion for- 
mula by expanding both f and g = Rf as Fourier 
series with respect to the polar angle. The Fourier coef- 
ficients of / and g are then related by Abel- type integral 
equations involving Chebyshev polynomials. 

X-ray tomography was first suggested and studied by 
the physicist Allan Cormack in 1963, and due to the 
efforts of the electrical engineer Godfrey Hounsfield, 
it was introduced into medical practice in the 1970s. 
For their contributions to X-ray tomography Cormack 
and Hounsfield were awarded the 1979 Nobel Prize for 
Medicine. 

9 Inverse Scattering 

Scattering theory is concerned with the effects that 
obstacles and inhomogeneities have on the propaga- 
tion of waves and particularly time-harmonic waves. 
Inverse scattering provides the mathematical tools for 
fields such as radar, sonar, medical imaging, and non- 
destructive testing. 

For time-harmonic waves the time dependence is rep- 
resented in the form U(x,t) = Re{-u(x)e _lcut } with a 
positive frequency to; i.e, the complex-valued space- 
dependent part u represents the real-valued amplitude 
and phase of the wave and satisfies the Helmholtz equa- 
tion Au + k 2 u = 0 with a positive wave number k. 
For a unit vector d e R 3 , the function e lkx d satisfies 
the Helmholtz equation for all x e R 3 . It is called a 
plane wave, since ^(kx-d-iot) j s CO nstant on the planes 


kx ■ d - uat = const. Assume that an incident field is 
given by u ln (x) = e lkx ' d . Then the simplest obstacle 
scattering problem is to find the scattered field u sc as 
a solution to the Helmholtz equation in the exterior of 
a bounded scatterer He R 3 such that the total field 
u = u m + m sc satisfies the Dirichlet boundary condi- 
tion u = 0 on 3D modeling a sound-soft obstacle or a 
perfect conductor. In addition, to ensure that the scat- 
tered wave is outgoing, it has to satisfy the Sommerfeld 
radiation condition 

limrl^ ikit sc ) = 0, (28) 

r-o o \ dr J 

where r = | x | and the limit holds uniformly in all direc- 
tions x/|x|. This ensures uniqueness of the solution to 
the exterior Dirichlet problem for the Helmholtz equa- 
tion. Existence of the solution was established in the 
1950s by Vekua, Weyl, and Muller via boundary integral 
equations in the spirit of section 3. 

The radiation condition (28) can be shown to be 
equivalent to the asymptotic behavior 

mSC(x) = t^{ m “ ( * )+ 0 (r)}’ 1x1 

uniformly for all directions x = x/|x|, where the func- 
tion Uoo defined on the unit sphere S 2 is known as 
the far-field pattern of the scattered wave. We indi- 
cate its dependence on the incident direction d and the 
observation direction x by writing = u m (x,d). The 
inverse scattering problem now consists of determin- 
ing the scattering obstacle D from a knowledge of u 
As an example of an application, we could think of the 
problem of determining from the shape of the water 
waves arriving at the shore whether a ball or a cube 
was thrown into the water in the middle of a lake. We 
note that this inverse problem is nonlinear since the 
scattered wave depends nonlinearly on the scatterer 
D, and it is ill-posed since the far-field pattern Uoo is 
an analytic function on S 2 with respect to x. 

Roughly speaking, one can distinguish between three 
groups of methods for solving the inverse obstacle 
scattering problem: iterative methods, decomposition 
methods, and sampling methods. Iterative methods 
interpret the inverse problem as a nonlinear ill-posed 
operator equation that is solved by methods such as 
regularized Newton-type iterations. The main idea of 
decomposition methods is to break up the inverse 
scattering problem into two parts: the first part deals 
with the ill-posedness by constructing the scattered 
wave u sc from its far-field pattern u„, and the sec- 
ond part deals with the nonlinearity by determining the 
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unknown boundary 3 D of the scatterer as the set of 
points where the boundary condition for the total field 
is satisfied. Since boundary integral equations play an 
essential role in the existence analysis and numerical 
solution of the direct scattering problem, it is not sur- 
prising that they are also an efficient tool within these 
two groups of methods for solving the inverse problem. 

Sampling methods are based on choosing an appro- 
priate indicator function f on R 3 such that its value 
/(z) indicates whether z lies inside or outside the 
scatterer D. In contrast to iterative and decomposition 
methods, sampling methods do not need any a priori 
information on the geometry of the obstacle. However, 
they do require knowledge of the far-held pattern for a 
large number of incident waves, whereas the iterative 
and decomposition methods, in principle, work with 
just one incident held. 

For two of the sampling methods— the so-called lin- 
ear sampling method proposed by Colton and Kirsch 
and the factorization method proposed by Kirsch— the 
indicator functions are dehned using ill-posed linear 
integral equations of the hrst kind involving the inte- 
gral operator F : L 2 (S 2 ) — L 2 (S 2 ) with kernel (x, d ) 
given by 

(F g) := Uoo{x, d)g(d) ds(d), xeS 2 . 

Js 2 

With the far-held pattern 

<F^(x,z) = (4n)^ 1 e~ lkx ' z 

of the fundamental solution 

oXk\x-z\ 


of the Helmholtz equation with source point atzel 3 , 
the linear sampling method is based on the ill-posed 
equation 

Fg z = #»(■, z), (29) 

whereas the factorization method is based on 

(F*Fg z ) 1 ' 4 =#«(■, z). (30) 

An essential tool in the linear sampling method is the 
Herglotz wave function with kernel g, dehned as the 
superposition of plane waves given by 

v g (x) \= j e lkx ' d g(d) ds(d), xeR 3 . 

It can be shown that, if z e D, then the value of the Her- 
glotz wave function v gza (z) with the kernel g z ,<x given 
by the solution of (29) obtained by Tikhonov regular- 
ization with parameter a remains bounded as cx — 0, 
whereas it is unbounded if z D. Evaluating v gza (z) 
on a sufficiently hne grid of points z, the scatterer D can 


be visualized as the set of those points where v gza (z) 
is small. The main feature of the factorization method 
is the fact that equation (30) is solvable in L 2 (S 2 ) if and 
only if z e D. With the aid of the solubility condition 
(23) in terms of a singular system of F, this can be uti- 
lized to visualize the scatterer D as the set of points z 
from a grid where the series (23), applied to the equa- 
tion (30), converges, that is, where its approximation by 
a finite sum remains small. 

Three of the items in the further reading list below 
are evidence of my enduring love of integral equations. 

Further Reading 

Colton, D., and R. Kress. 2013. Inverse Acoustic and Electro- 
magnetic Scattering Theory >, 3rd edn. New York: Springer. 
Ivanyshyn, O., R. Kress, and P. Serranho. 2010. Huygens’ 
principle and iterative methods in inverse obstacle scat- 
tering. Advances in Computational Mathematics 33:413- 
29. 

Kirsch, A., and N. Grinberg. 2008. The Factorization Method 
for Inverse Problems. Oxford: Oxford University Press. 
Kress, R. 2014. Linear Integral Equations , 3rd edn. New 
York: Springer. 

Natterer, F. 2001. The Mathematics of Computerized Tomog- 
raphy. Philadelphia, PA: SIAM. 


IV. 5 Perturbation Theory and 
Asymptotics 

Peter D. Miller 


1 Introduction 

Perturbation theory is a tool for dealing with certain 
kinds of physical or mathematical problems involv- 
ing parameters. For example, the behavior of many 
problems of turbulent fluid mechanics is influenced 
by the value of the Reynolds number, a dimensionless 
parameter that measures the relative strength of forces 
applied to the fluid compared with viscous damping 
forces. Likewise, in quantum theory the Planck con- 
stant ft is a parameter in the Schrodinger equation that 
governs dynamics. 

The key idea of perturbation theory is to try to take 
advantage of special values of parameters for which the 
problem of interest can be solved easily to get informa- 
tion about the solution for nearby values of the param- 
eters. As the parameters are perturbed from their orig- 
inal simplifying values, one expects that the solution 
will also be correspondingly perturbed. Perturbation 
methods allow one to compute the way in which the 
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solution changes under perturbation, and perturbation 
theory explains how the resulting computation is to be 
properly understood. 

Clearly the potential for perturbation theory suc- 
cess is tied to the possibility that a parameter can 
be regarded as being tunable. Tunability is sometimes 
rather obvious; for instance, the Reynolds number is 
tuned in fluid experiments either by changing the 
applied forces (this changes the numerator) or by using 
different fluids with different viscosities (this changes 
the denominator). In quantum mechanics, however, it 
is not reasonable to tune Planck’s constant, as it takes 
a fixed value: h 1.0S x 10 ~ 34 kg m 2 s _1 . Here, tun- 
ability can be recovered by nondimensionalizing the 
problem; one should introduce units M of mass, L of 
length, and T of time that are characteristic of the 
problem at hand and then consider the dimensionless 
ratio H := Th/ML 2 instead of h. The dimensionless 
parameter H then becomes tunable via M, L, and T. 

In this introduction we have followed the overwhelm- 
ing majority of the literature and used the terms “per- 
turbation theory” and “perturbation methods” almost 
interchangeably. However, we will try to be more pre- 
cise from now on, referring to perturbation meth- 
ods when describing the mechanical construction of 
approximate solutions of perturbed problems, while 
describing the mathematical analysis of the approxi- 
mations obtained and their convergence properties as 
perturbation theory. 

1.1 A Basic Example 

As a first example of perturbation methods, suppose 
we want to find real solutions x of the polynomial 
equation 

x 5 + CI4X 4 + CI3X 3 + a 2X 2 + aix + ao = 0. (1) 

Here, the real coefficients aj are the parameters of the 
problem. There is no explicit formula for the roots x of 
a quintic polynomial in general, but for certain values 
of the parameters the situation is obviously much bet- 
ter. For example, if 0.4 = CI 3 = a 2 = fli = 0 and ao = — 1, 
then ( 1 ) reduces to the problem x 5 = 1 , which clearly 
has a unique real solution, x = 1. To perturb from 
this exactly solvable situation, make the replacements 

aj -*■ eaj for ; = 1,2, 3,4 and ao l + £(ao + l). Then, 

when e = 0 we have the simple exactly solvable case, 
and when e = 1 we recover the general case. It is tradi- 
tional for the Greek letter e to denote a small quantity 
in perturbation problems. Our problem can therefore 


be written in the form 

R 0 (x) + eP\{x) — 0, (2) 

where e g [0 , 1] is our tunable parameter, and 
P 0 (x) := x 5 - 1, 

Pi (x) := 0-4X 4 + a 3 X 3 + a 2 X 2 + aix + ao + 1. 

At this point, we have redefined the original prob- 
lem somewhat, as the goal is now to understand how 
the known roots of (2) for e = 0 begin to change for 
small nonzero e. (It will be clear that while perturba- 
tion methods suffice to solve this redefined problem, 
they may not be sufficiently powerful to describe the 
solutions of the original problem (1) as it may not be 
possible to allow e to be as large as e = 1.) In this case, 
perturbation theory amounts to the invocation of the 
implicit function theorem, which guarantees that, since 
Pq(1) = 5 * 0, there is a unique solution x = x(f) of 
(2) that satisfies x(0) = 1 and that can be expanded 
in a convergent (for |f| sufficiently small) power series 
in powers of e. On the other hand, perturbation meth- 
ods are concerned with the effective construction of the 
series itself. We do not attempt to find a closed-form 
expression for the general term in the series; rather, 
we find the terms iteratively by the following simple 
procedure amounting to an algorithm to compute the 
first N + 1 terms. We write 

N 

x(e) = X X n E n + Rv(£), 

n = 0 

where the remainder term in the Taylor expansion sat- 
isfies e~ n Rn(e) 0 as e — • 0. By substituting this 
expression into (2) and expanding out the multinomials 
x(e) p that occur there, one rewrites (2) in the form 

N 

X PnE n + Q N (E) = 0 , (3) 

n= 0 

where the p n are certain well-defined expressions in 
terms of {xo,...,Xjv} and Qat(£) is a remainder that, 
like Rjv(f), satisfies e~ n Qn(e) — 0 as e 0. Since (3) 
should hold for all sufficiently small e, it is easy to show 
that we must have p n = 0 for all n = 0 This 
is a system of equations for the unknown coefficients 
{xo Xjv}. The first few values of p„ are 

po := Po(x 0 ), 

Pi := Pq(x 0 )xi + Pi(x 0 ), 

P2 ■= 2Pq(Xo)X 2 +Po'(xo)xi + 2P{ (x 0 )xi. 

These display a useful triangular structure, in that p n 
depends only on {xo,...,x n }, and we also note that 
p n is linear in x n . These features actually hold for all 
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n, and they allow the construction of the series coeffi- 
cients in x(e) in a completely systematic fashion once 
the unperturbed solution xq = 1 is specified. Indeed, 
considering the equations pi = 0, P 2 = 0, and so on in 
order, we see that 

PliXp) 

Xl PofroV 

= Pjix o)Xi _ Pq(xq)x\ 

X 2 “ Pq(xo) 2Pq(xq) 

_ Pi(xq)P[(xq) _ Pi(xq) 2 Pq (xq) 

Pq(x o ) 2 2Pq(xo) 3 ’ 

and so on. Note that the denominators are nonzero 
under exactly the same condition that the implicit func- 
tion theorem applies. In this way, the perturbation 
series coefficients x n are systematically determined 
one after the other. Note also that if we were interested 
in complex roots x of the quintic, we could equally well 
have started developing the perturbation series from 
any of the five complex roots of unity xo = e 2mkl5 for 
k = 0,1, 2, 3,4. 

This example shows several of the most elementary 
features of perturbation methods: 

(i) The effect of perturbation of the parameter e from 
the special value e = 0 is to introduce correc- 
tions to the unperturbed solution xo in the form 
of an infinite perturbation series of corrections of 
higher and higher order. 

(ii) Once the leading term of the series has been 
obtained by solving the reduced problem with 
e = 0, the subsequent terms of the perturba- 
tion series are all obtained by solving inhomogen- 
eous linear equations of the form Pq(xo)u = f, 
where / is given in terms of previously calculated 
terms and Pq(xo) denotes the linearization of the 
unperturbed problem about its exact solution xo. 

The latter feature makes perturbation methods an 
attractive way to attack nonlinear problems, as the 
procedure for calculating corrections always involves 
solving only linear problems. 

2 Asymptotic Expansions 
2.1 Motivation 

Consider the second-order differential equation for 
y(x): 

-ex 3 y" (x) + y(x) = x 2 . (4) 

Let us try to solve (4) for x > 0. This equation gen- 
erally has no elementary solutions, but we may notice 


that when £ = 0 it is obvious that y(x) = x 2 . Tak- 
ing a perturbative approach to include the effect of the 
neglected term, we may seek a solution in the form of 
a power series in e : 

y(x) ~ X yn(x)£ n , yo(x):=x 2 . (5) 

n-0 

The notation will be properly explained below in 
section 2.2; for now the reader may think of it as “=”. 
Substituting this series into (4) and equating the terms 
with corresponding powers of e gives the recurrence 
relation 

y n (x) = x 3 y„_ i(x), n > 0. (6) 

This recurrence is easily solved, and one finds that 
y n {x) = (n + 1)! n\x n+2 for all n > 0, and therefore 
the series (5) becomes 


y(x) ~ ^ (n + 1 )!n!x n+2 f". (7) 

n = 0 


Now let us consider carefully the meaning of the 
power series in e on the right-hand side of (7). The abso- 
lute value of the ratio of successive terms in the series 
is 


y n +i(x)E r 


= (n + 2)(n + 1)x\e\, 


y n (x)£ n 

and this ratio blows up as n — ■ co regardless of the 
value of x > 0, unless of course e = 0. Therefore, by 
the ratio test, the series on the right-hand side of (7) 
diverges (has no finite sum) for every value of x > 0 
unless e = 0. Another way of saying the same thing is 
that the error or remainder Rn(x,e) upon truncating 
the series after the term proportional to e n . 


Rn(x, e) := y (x) - ^ (n + 1)! n!x” 


n-0 


does not tend to zero (or, for that matter, any finite 
limit) as TV -► co, no matter what values we choose for 
x and e . It is simply not possible to use partial sums of 
the series (7) to get better and better approximations to 
y (x) by including more and more terms in the partial 
sum. 

On the other hand, there does indeed exist a partic- 
ular solution y(x) of (4) for which the partial sums 
are good approximations. The key idea is the follow- 
ing: instead of trying to choose the parameter TV (the 
number of terms) to make the remainder Rx (x,e) small 
for fixed e > 0, try to choose the parameter e small 
enough that R^(x,e) is less than some tolerance for 
TV fixed. It turns out that there is a positive constant 
K n (x) independent of e such that 


\R n (x,e)\^K n (x)e n+1 . (8) 
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This fact shows that for each given N, the error in 
approximating y(x ) by the JVth partial sum of the 
series (7) tends to zero with £ , and it does so at a rate 
depending on the number of retained terms in the sum. 
This is the property that makes the series on the right- 
hand side of (7) an asy'mptotic expansion of y(x). We 
must understand that it makes no sense to add up all 
of the terms in the series, but the partial sums are good 
approximations of y(x) when £ is small, and the error 
of approximation goes to zero faster with s the more 
terms are kept in the partial sum. 

2.2 Definitions and Notation 

We begin to formalize some of this notation by intro- 
ducing some simple standard notation due to Edmund 
Landau for estimates involving functions of e. Let 
/(£, p) be a function of e for e sufficiently small, 
depending on some auxiliary parameter p. We write 

f(£,p) = 0(g(s)), e - 0, 

and say that “f is big-oh of g” if there exists some 
K(p) >0 such that, for each p, 

\f(E,p)\ ^ K(p)\g(e)\, Vp, 

and for all e small enough. If K(p) can be chosen to be 
independent of p, then / is big-oh of g uniformly with 
respect to p. We write 

f(E,p) = o(g(E)), £ - 0 , 

and say that “/ is little-oh of g" if for every K > 0 there 
is some 8(p,K) > 0 such that 

\e\^6(p,K) => \f(E,p)\^K\g(E)\, V P . 

As with big-oh, if 5 is independent of p, then / is little- 
oh of g uniformly with respect to p. If g is a function 
that is nonzero for all sufficiently small e * 0, then 
f = o(g) is the same thing as asserting that f /g — 0 
as e — ■ 0. (This is often used in the special case when 
g(E) = 1.) Heuristically, / = O(g) means that / “is no 
bigger than” g in a neighborhood of e = 0, while f = 
o(g) means that / “is much smaller than” g in the limit 
e — ■ 0. The convenience of Landau’s notation is that it 
avoids reference to various constants that always occur 
in estimates. For example, (8) could easily be written 
without reference to the constant Kn(x) in the form 
Rn(x,e) = 0(e n+1 ) as e — ■ 0. 

A sequence of functions {<£n(e)} = {4>n(^)}n = o i s 
called an asy'mptotic sequence in the limit £ — 0 if, for 
each n, <£ n+ i(f) = o(4> n (£)) as e — ■ 0. Given an asymp- 
totic sequence {</>n(f)} and an arbitrary numerical 


sequence {a n } = {a n }™ = 0 , the purely formal series 

00 

' an&n (£) 
n=0 

is called an asymptotic series. Such a series is said to be 
an asymptotic expansion of a function /(£), written in 
the form 

oo 

f(E)~Y J a n4>n(£), £ - 0, (9) 

n= 0 

if, for each N = 0, 1, 2, ... , 

N 

/(f) - X «)!</>„(£) = o(0n(£)), £-0. (10) 

n= 0 

From this relation it follows that, if /(£) has an asymp- 
totic expansion with respect to {</>«,(£)}, then the coef- 
ficients { a n } are uniquely determined by the recursive 
sequence of limits 

1 f n ^ 1 

a n := lim /(e) - X ctk4>k(t) ■ 

£-o q> n (E) L J 

Indeed, the existence of each of these limits in turn is 
equivalent to the assertion that / has an asymptotic 
expansion with respect to the sequence {<pn ( £)}■ On 
the other hand, the function / is most certainly not 
determined uniquely given the asymptotic sequence 
{</> n (e) } and the coefficients {a n }; given /(£) satisfying 
(10), f (e) + g ( e) v\ill also satisfy (10) if g(s) = o($ n (f)) 
as £ — • 0 for all n. This condition by no means forces 
g (f) = 0; such a function g that is too small in the limit 
as £ — • 0 to have any effect on the coefficients {a n } is 
said to be beyond all orders with respect to {$„(£)}■ 
The simplest example of an asymptotic sequence, 
and one that occurs frequently in applications, is the 
sequence of integer powers <£ n (f) := £ n - In this con- 
text, a function g that is beyond all orders is some- 
times called transcendentally small or exponentially 
small, indeed, a particular example of such a function 
is g(E) = expl-lel^ 1 )- 

Now, the notation used in (9) should be strongly con- 
trasted with the standard notation used for convergent 
series: 

/(£) = X a n (pn(£)- ( 11 ) 

n= 0 

The use of “=” here implies in particular that the 
expressions on both sides are the same kind of object: 
functions of f with well-defined values. Furthermore, 
the only way to unambiguously assign a numerical 
value to the infinite series on the right-hand side given 
a value of £ is to sum the series, that is, to compute 
the limit of the sequence of partial sums. By contrast, 
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the meaning (10) given to the expression (9) in no way 
implies that the series on the right-hand side can be 
summed for any e at all. Therefore, we view the rela- 
tion (9) as defining an infinite hierarchy of approxima- 
tions to the function f(e ) given by (well-defined) par- 
tial sums of the formal series; each subsequent partial 
sum is a better approximation than the preceding one 
when the error is made small by letting a tend to zero, 
precisely because (10) holds and the functions {/ n U)} 
form an asymptotic sequence. However, it need not be 
the case that the error in approximating /(f) by the 
IVth partial sum can be made small by fixing some £ 
and letting N increase. Only in the latter case can we 
use the convergent series notation (11). 

The subject of perturbation theory is largely con- 
cerned with determining the nature of a series obtained 
via formal perturbation methods. Some perturbation 
series are both convergent (for sufficiently small f ) and 
asymptotic as f — 0; this was the case in the example 
of the perturbation of the unperturbed root xq = 1 in 
the root-finding problem in section 1.1. On the other 
hand, most perturbation series are divergent, as in the 
case of the expansion considered in section 2.1, and 
in such cases proving the validity of the perturbation 
series requires establishing the existence of a true, f- 
dependent solution of the problem at hand to which 
the perturbation series is asymptotic in the sense of 
the definition (9), (10). This in turn usually amounts to 
formulating a mathematical problem (e.g., a differential 
equation with side conditions) satisfied by the remain- 
der and applying an appropriate fixed-point or iteration 
argument. 

Some related notation used in papers on the sub- 
ject includes the following. The notation / <K g is fre- 
quently used in place of / = o (g) . Also, one sometimes 
sees the notation / < g for / = 0(g). It should also 
be remarked that the symbol often appears as a 
relation between functions in the following two senses: 

(i) /(f) ~ g( f) may indicate that both /(f) = 
0(g( f)) and also g(s) = 0(/(f)) (that is, / is 
bounded both above and below by multiples of g) 
as f — ■ 0. 

(ii) /(f) ~ g( f) may indicate that f(e)lg(E) — ■ 1 as 
f — 0, a special case of the above notation. 

To avoid any confusion, we will use the symbol only 
in the sense defined by (9), (10). 

The theory of asymptotic expansions applies in a 
number of contexts beyond its application to pertur- 
bation problems. As just one very important example, 


it is the basis for a collection of very well-developed 
methods for approximating certain types of integrals. 
Key methods include Laplace’s method for the asymp- 
totic expansion of real integrals with exponential inte- 
grands, Kelvin’s method of stationary > phase for the 
asymptotic expansion of oscillatory integrals, and the 
method of steepest descent (or saddle-point method) 
applying to integrals with analytic integrands. Readers 
can find detailed information about these useful meth- 
ods in the books of Bleistein and Handelsman (1986), 
Wong (2001), and Miller (2006). 

3 Types of Perturbation Problems 

Perturbation problems are frequently categorized as 
being either regular or singular. The distinction is not a 
precise one, so there is not much point in giving careful 
definitions. However, the two kinds of problems often 
require different methods, so it is worth considering 
which type a given problem most resembles. 

3.1 Regular Perturbation Problems 

A regular perturbation problem is one in which the per- 
turbed problem (f =t= 0) is of the same general “type” as 
the unperturbed problem (f = 0) that can be solved eas- 
ily. Regular perturbation problems often lead to series 
that are both asymptotic as f — 0 and convergent for 
sufficiently small f . 

One example of a regular perturbation problem is 
that of finding the energy levels of a perturbed quan- 
tum mechanical system. Consider a particle moving in 
one space dimension subject to a force F = —V'(x), 
where x is the position of the particle. The problem 
is to find nontrivial square-integrable “bound states” 
i p(x) and corresponding energy levels E e R such 
that Schrddinger’s equation dfip = Eip holds, where 
J-f = J-fo + sdf] and 

dfoW(x) := -<//'(%) + V 0 (x)ip(x), 

3~f\ip(x) := Vi(x)ip(x). 

Here we have artificially separated the potential energy 
function V into two parts, V = Vo + eVp, the idea is 
to choose Vo such that when £ = 0 it is easy to solve 
the problem by finding a nonzero function ipo £ L 2 (M) 
and a number Eq that satisfy 3-fo<Fo = £o<A'o- In this 
one-dimensional setting it turns out that, given Eq, 
all solutions of this equation are proportional to ipo 
(which makes the energy level Eq “nondegenerate” in 
the language of quantum mechanics). By choosing an 
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appropriate scaling factor, we may assume that ipo is 
“normalized” to satisfy 

f ip o(x) 2 dx = 1. (12) 

Jr 

Now, to calculate the effect of the perturbation, we 
may suppose that both ip and E are expandable in 
asymptotic power series in e : 

00 00 

(//-£ £"<//„, £-0. (13) 

n = 0 n = 0 

The coefficients then necessarily satisfy the hierarchy 
of equations: 

n 

dliilp'n E.jp n - ij Ip / p I ] Ip n 1 ■ (14) 

1=1 

Denote the right-hand side of this equation by f n (x). 
Then (14) has a solution < p n (x) if and only if /„(x) 
satisfies a solvability condition (stemming from the 
Fredholm alternative)'. 

ipoix)fn(x) dx = 0. 

Jr 

This solvability condition is in fact a recursive formula 
for E n in disguise: 

En = IpoMUfilpn-l (x) dx 
Jm 

n- 1 

~lEj 
1=1 

where we have used (12). Once E n is determined from 
this relation, the equation (14) can be solved for ip n , but 
the latter is only determined modulo multiples of ip o; 
one typically chooses the correct multiple of ipo to 
add in order that ip n be orthogonal to ipo in the sense 
that 

ipo(x)ip n (x) dx = 0, n > 0. (16) 

Jr 

Subject to (15), (14) has a unique solution determined 
by the auxiliary condition (16). Note that this condition 
actually ensures that the sum on the right-hand side of 
(15) equals zero. 

The perturbation expansions (13) of the pair (i p,E) 
are known as Rayleigh-Schrodinger series. Under suit- 
able conditions on the operators 3-fo and 3~C\ it can be 
shown (by the method of Lyapunov-Schmidt reduction 
to eliminate the eigenfunction) that the power series 
(13) are actually convergent series, and hence and 
“=” can be used interchangeably in this context. 


3.2 Singular Perturbation Problems 

In a singular perturbation problem, the perturbed and 
unperturbed problems are different in some essen- 
tial way. The most elementary examples involve root 
finding. Consider the problem of finding the roots of 
the polynomial P(x) := ex 3 -x+1 = 0 when e is small 
and positive. The unperturbed problem (with e = 0) is 
to solve the linear equation -x + 1 = 0, which of course 
has the unique solution x = 1. However, the perturbed 
problem for e + 0 is to find the roots of a cubic, and 
by the fundamental theorem of algebra there are three 
such roots. The perturbed and unperturbed problems 
are of different types because setting e = 0 changes the 
degree of the equation. 

Somehow, two of the roots disappear altogether from 
the complex plane as e — 0. Where can they go? Some 
intuition is obtained in this case simply by looking at 
a graph of P(x) when e is very small; while one of the 
roots looks to be close to x = 1 (the unique root of the 
unperturbed problem), the other two are very large in 
magnitude and of opposite signs. So the answer is that 
the two “extra” roots go to infinity as e — 0. 

To completely solve this problem using perturbation 
methods, we need to capture all three roots. The root 
near x = 1 when e is small can be expanded in a power 
series in e whose coefficients can be found recursively 
using exactly the same methodology as in section 1.1. 
Finding the remaining two roots requires another idea. 

The key idea is to try to pull the two escaping roots 
back to the finite complex plane by rescaling them by 
an appropriate power of e. Let p > 0 be given, and write 
x = yE~ p . This is a change of variables in our problem 
that takes a large value of x, proportional to e~ p , and 
produces a value of y that does not grow to infinity as 
e — • 0. In terms of y, the root-finding problem at hand 
takes the form 

£ J-3p y 3 _ £ -Py + i = o. (17) 

Now, p > 0 is undetermined so far, but we Mil choose 
it (and hence determine the rate at which the two roots 
are escaping to infinity) using the principle of dominant 
balance. 

By a balance we simply mean a pair of terms on 
the left-hand side of (17) having the same power of e 
(by choice of p > 0). A balance is called dominant if 
all other terms on the left-hand side are big-oh of the 
terms involved in the balance. The principle of domi- 
nant balance asserts that only dominant balances lead 
to possible perturbation expansions. There are three 


ipo(x )ip n -j (x ) dx , (15) 



214 


IV. Areas of Applied Mathematics 


pairs of terms to choose from, and hence three possible 
balances to consider: 

(i) Balancing f 1_3p y 3 with 1 requires choosing p = 
3. The terms involved in the balance are then 
both proportional to £°, while the remaining term 
is proportional to £~ 1/3 , so this balance is not 
dominant. 

(ii) Balancing e~ v y with 1 requires choosing p = 0. 
The terms involved in the balance are then pro- 
portional to £°, making the balance dominant over 
the remaining term as £ — ■ 0. Since p = 0 this 
rescaling has had no effect (y = x ), and in fact 
setting the sum of the dominant balance terms to 
zero recovers the original unperturbed problem. 
No new information is gained. 

(iii) Balancing f 1 - 3 ^ 3 with E~ p y requires choosing 
p = \. The terms involved in the balance are then 
both proportional to £ _1/2 , making the balance 
dominant over the remaining term as e — 0. This 
is a new dominant balance. 

The new dominant balance will lead to perturbation 
expansions of the two large roots of the original prob- 
lem. Indeed, with p = \, for e * Q our problem takes 
the form 

y 3 - y + £ 1/2 = 0, (18) 

which now appears as a perturbation of the equation 
y 3 -y = 0. The latter has three roots: y = yo withyo = 
0 or yo = ±1. Obviously, if (18) has solutions y that 
are close to yo = ±1 when £ is small, then the original 
problem will have two corresponding solutions that are 
roughly proportional to f^ 1/2 . These will then be our 
two missing roots. The expansion procedure for (18) 
with y = yo = ± 1 for £ = 0 is similar to that described 
in section 1.1, except that y (f) will be a power series 
in £ 1/2 . The implicit function theorem applies to (18), 
showing that the perturbation series for y(f) will be 
convergent for f sufficiently small, in addition to being 
asymptotic in the limit £ — 0. Scaling y by f~ 1/2 then 
yields series representations for the two large roots x 
of the original problem in the form 

x ~ X £ — 0, 

n= 0 

with yo = ±1. As already mentioned, in this case 
could be replaced by “=” if £ is small enough. 

Another major category of singular perturbation 
problems are those involving differential equations in 
which the small parameter £ multiplies the highest- 
order derivatives. In such a case, setting £ = 0 replaces 


the differential equation by another one of strictly 
lower order. This is clearly analogous to the algebraic 
degeneracy described above in that the number of solu- 
tions (this time counted in terms of the dimension of 
some space of integration constants) is different for 
the perturbed and unperturbed problems. We now dis- 
cuss some common perturbation methods that are gen- 
erally applicable to singular perturbation problems of 
this latter sort. 

3.2.1 WKB Methods and Generalizations 

WKB methods (named after Wentzel, Kramers, and Bril- 
louin) concern differential equations that are so singu- 
larly perturbed that upon setting £ to zero there are 
no derivatives left at all. A sufficiently rich example 
equation with this property is the Sturm-Liouville, or 
stationary Schrodinger, equation 

E 2 qj"(x) + f(x)ip(x) = 0, (19) 

where /(x) is a given coefficient and solutions ip = 
1 p(x\E) are desired for sufficiently small s > 0. Obvi- 
ously, for the two terms on the left-hand side to sum to 
zero when f is independent of £, derivatives of ip must 
be very large compared with <// itself. The essence of 
the WKB method is to take this into account by making 
a substitution of the form 

(/>(*;£)= exp Q J u(5;f)d^, (20) 

and in terms of the new unknown u, (19) becomes a 
first-order nonlinear equation of Riccati type: 

eu'(x; f) + u(x\ e) 2 + fix) = 0. (21) 

Unlike (19), this equation admits two nontrivial solu- 
tions when £ = 0, namely w(x;0) = ±(-/(x)) 1/2 . One 
can now develop a perturbation series for u in pow- 
ers of £ in the usual way, starting from each of these 
two solutions. These series can be shown to be asymp- 
totic to certain true solutions of (21) in the sense of 

(9) , (10), but they are nearly always divergent. When par- 
tial sums of these two distinct series are substituted 
into (20), one obtains approximations to two linearly 
independent solutions of (19). 

The WKB method as described above always fails 
near points x where the coefficient / vanishes. Such 
x are called turning points. What we mean when we 
say that the method fails is that the little-oh relation 

(10) does not hold uniformly on any open interval of 
x with a turning point as an endpoint and, moreover, 
there is no solution that is accurately approximated by 
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partial sums of the WKB expansion over a neighbor- 
hood of x containing a turning point. An important 
problem in the theory of equation (19) is the connec- 
tion problem, in which a solution described accurately 
by a WKB expansion on one side of a turning point is to 
be approximated on the other side of the turning point 
by an appropriate linear combination of WKB expan- 
sions. This problem can sometimes be solved within 
the context of the WKB method by analytically continu- 
ing a solution ip into the complex x-plane and around 
a turning point. Such an approach always requires ana- 
lyticity of the coefficient function f and in any case fails 
to describe the solution ip in any neighborhood of the 
turning point. 

A more satisfactory way of dealing with turning 
points and solving connection problems is to general- 
ize the WKB method (that is, generalize the ansatz (20)). 
The following approach is due to Langer. 

By a simultaneous change of independent and depen- 
dent variables of the form y = g(x) and <//(x;e) = 
a(x)<p(y\£), respectively, one tries to choose g and a 
(as smooth functions independent of e) such that (19) 
becomes a perturbation of a model equation: 

(i) E 2 <p"(y) ± fly) = s 2 a±(y)4> in intervals of x 
where /(x) ^ 0; 

(ii) e 2 4>" (y) - y n <p(y) = e 2 f) n ly)4> in intervals of x 
where / has a turning point (zero) of order n. 

Here, <x±(y) and fi n (y) are smooth functions of y 
explicitly written in terms of f and its derivatives, and 
avoiding terms proportional to <p' (y) requires relating 
aandgbya(x) = |c/'(x)|~ 1/2 . In each case, the model 
equation is obtained by neglecting the terms on the 
right-hand side (which turn out to be “more” negligible 
than the singular perturbation term E 2 <p" (y)). 

To solve (19) on an interval without turning points, 
one may arrive at case (i) via the (Liouville-Green) 
transformation 

y = 3(x) = [ ^J±fi%) d§. 

Jx 0 

This transformation is smooth and invertible in the 
absence of turning points. It is easy to confirm that 
the use of the elementary exponential solutions of the 
model equation e 2 4>" ly) ± fly) = 0 alone immedi- 
ately reproduces the first two terms of the standard 
WKB expansion (the coefficient a(x) corresponds to the 
term in u that is proportional to e). Treating the error 
term £ 2 a ± (y)f(y) perturbatively reproduces the rest 
of the WKB series. 


To solve (19) on an interval containing a simple 
turning point Xo, one should take n = 1 in case (ii). 
Arriving at this target requires choosing the (Langer) 
transformation 


y = glx) = - sgn(/(x)) 


* Vl/(S)ld§ 

Jx 0 


2/3 


This transformation is smooth and invertible near xo 
precisely because xo is a simple zero of /. In this case, 
the model equation E 2 f"ly) - y<p(y) = 0 is not solv- 
able by elementary functions, but it is solvable in terms 
of special functions known as Air y functions. Airy func- 
tions can be expressed as certain contour integrals, 
and this is enough information to allow the solution 
of a number of connection problems for simple turn- 
ing points without detouring into the complex x-plane. 
For double turning points (n = 2) one needs a dif- 
ferent transformation g(x) that is again smooth, and 
instead of Airy functions one has Weber functions, but 
again these can be WTitten as integrals, allowing con- 
nection problems to be solved. For n ^ 3 the solvabil- 
ity of the model equation (in terms of useful special 
functions) becomes a more serious issue. Nonetheless, 
when Langer’s generalization applies it not only allows 
an alternative approach to connection problems but 
also provides accurate asymptotic formulas for solu- 
tions of (19) in full neighborhoods of turning points 
where standard WKB methodology fails. 

One famous formula that can be obtained with the 
use of the WKB method and connection formulas for 
simple turning points is the Bohr-Sommerfeld quanti- 
zation rule of quantum mechanics. Consider the case 
in which /(x) takes the form fix) = E - V(x), where 
V is a potential energy function with V" lx) > 0, and 
therefore its graph has the shape of a “potential well” 
whose minimum value we take to be V = 0. Then (19) 
with e 2 = h 2 /(2m) is the equation satisfied by station- 
ary quantum states for a particle of mass m and total 
energy E, and E should take on a discrete spectrum of 
values such that there exists a solution ip that is square 
integrable onl. If E > 0 then/(x) will have exactly two 
simple turning points, x_ IE) < x+ IE), and one can use 
WKB methods and connection formulas to obtain two 
nonzero solutions < p±lx) that exhibit rapid exponen- 
tial decay as x — ±oo. E is the energy of a stationary 
state if and only if i p- is proportional to i p+. By com- 
puting a Wronskian of these two solutions in the region 
between the two turning points and equating the result 
to zero, one obtains the Bohr-Sommerfeld quantization 
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rule as a condition for E > 0 to be an energy eigenvalue: 

rX+(E) , 

Je - V(x) dx = Ttsln + i) + 0(e 2 ). 

J X- (E) 

3.2.2 Multiple-Scale Methods 

Another class of perturbation methods deals with sin- 
gular perturbation problems of a different type in 
which what makes the problem singular is that an accu- 
rate solution of some differential equation is required 
over a very large range of values of the independent 
variables. Such problems often masquerade as regu- 
lar perturbation problems until the need for accuracy 
over long time intervals or large distances is revealed 
and understood. A typical example is the weakly anhar- 
monic oscillator, whose displacement u = u(t\s) as 
a function of time t is modeled by the initial-value 
problem for 

u" + o>oU = eu 3 , (22) 

subject to the initial conditions 

u(0\e) = A and u' (0 \e) = tooB. 

When e = 0, u undergoes simple harmonic motion with 
frequency too: it(t; 0) = Acos(toof) + Bsin(coot). The 
presence of the cubic perturbation term on the right- 
hand side does not modify the order of the differential 
equation, so this appears to be a regular perturbation 
problem (and indeed it is so for bounded t). 

To begin to see the difficulty, we should first observe 
that (22) conserves the energy: 

E = ^u'(t;£) 2 + | u)QU(t-,E ) 2 - |fit(t;£) 4 . 

As the energy function has a local minimum in the 
phase plane at (u, u') = (0, 0), and as this is the global 
minimum when e = 0, it follows that for each initial 
condition pair (A,B), if |f| is sufficiently small the cor- 
responding solutionis time-periodic, following a closed 
orbit in the phase plane. Now consider solving (22) 
using a perturbation (power) series in e. If the assumed 
series takes the form 

oo 

u(t\E) ~ ^ E n u„(t), e — 0, (23) 

n= 0 

with uo(t) = u(t; 0), then by substitution and collec- 
tion of coefficients of like powers of e it follows in 
particular that the first correction u\(t) solves 

u'{ + WqUi = ito(t) 3 , it i ( 0) = ni(0) = 0. (24) 

The forcing function Mq = (Acos(a>ot) + B sinfotof)) 3 
is known, and it contains terms proportional to the first 


and third harmonics. Solving for u\ (t) gives 

wi(t) = >(A 2 - 3 B 2 ) cos(3o>ot) 

32tt>o 

(3 A 2 - B 2 ) sin(3(Oof ) 

32(Oq 

+ (A 2 + B 2 ) [At sin(toof) - Bt cos(toot)] 

ocoo 

+ 2 “ 3B 2 ) cos(to 0 t) 

32(Oq 

+ y| i? ^(7A 2 + 3B 2 ) sin(toof)- 
32(Oq 

The terms on the first two lines here are the response to 
the third harmonic forcing terms in no(t) 3 , the terms 
on the third line are the response to the first harmonic 
forcing terms in no(t) 3 , and the last two lines consti- 
tute a homogeneous solution necessary to satisfy the 
initial conditions. This procedure can be easily con- 
tinued, and all of the functions u n (t ) are therefore 
systematically determined. The trouble with this pro- 
cedure is that, while we know that the true solution 
u(t\E ) is a periodic function of i, we have already found 
terms in ui(t) that are linearly growing in t, produced 
as a resonant response to forcing at the fundamental 
frequency. 

Terms growing in t are not troublesome only because 
they are nonperiodic; they also introduce nonunifor- 
mity with respect to t into the asymptotic condi- 
tion (10). Indeed, if t becomes as large as f -1 , then 
the term EUi(t) becomes comparable to the leading 
term uo(t), and condition (10) is violated. Such grow- 
ing terms in an asymptotic expansion are called sec- 
ular terms, from the French word siecle, meaning cen- 
tury, because they were first recognized in perturbation 
problems of celestial mechanics where they would lead 
to difficulty over long time intervals of about 100 years. 

It is not hard to understand what has gone wrong in 
the particular problem at hand. The point is that, while 
the exact solution is indeed periodic, its fundamental 
frequency is not exactly too but rather is slightly depen- 
dent on e. If we could know what the frequency is in 
advance, then we might expect the perturbation series 
to turn out to be a Fourier series in harmonics of the 
fundamental ^-dependent frequency. 

The method of multiple scales is a systematic method 
of removing secular terms from asymptotic expan- 
sions, and in the problem at hand it leads automat- 
ically to a power-series expansion of the correct fre- 
quency as a function of e. The “scales” in the name of 
the method are multiples of the independent variable 
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of the problem; we introduce a number of variables of 

the form T t := E k t for k = 0, 1, 2 Given some bnite 

K, we seek u(t, e) in the form u(t, e) = [/(To 7k; e) 

so that, by the chain rule, the ordinary differential 
equation (22) becomes a partial differential equation: 

K K , 2 ,r 

1 1 s j+k - 

j= 0 k =0 


dTjdTk 


to^U = eU 6 


Into this equation we now introduce an expansion anal- 
ogous to (23): 


easy to check that the squared amplitude A 2 + B 2 is 
independent of 7\, and hence A and B undergo sim- 
ple harmonic motion with respect to T\ of amplitude- 
dependent frequency toi := -3 (A 2 + B 2 ) / (8n>o). By 
standard trigonometric identities it then follows that 
Uq is a sinusoidal oscillation of frequency coo + + 

0(e 2 ), where toi depends on the amplitude (deter- 
mined from initial conditions). 

3.2.3 Matching of Asymptotic Expansions 


U(T 0 ,...,T k -,e) ~ X£ n Un(T 0 ,...,T K ), £ —* 0. 

n = 0 


The terms proportional to e° are 

3 2 Un , 
dj2 + WqUo = 0 , 

the general solution of which is Uo = A cos (mo To) + 
B sin(tooTb), where A and B are undetermined func- 
tions of the “slow times” T\,...,Tk. The terms propor- 
tional to e ] are then 


d 2 Ui . 2rr „ d 2 Uo 

W ° °' STodT,- 


(25) 


Comparing with (24), the hnal term on the right-hand 
side is the new contribution from the method of multi- 
ple scales. It contains only first harmonics: 


8 2 Uq 

3T 0 3Ti 


= -m 0 


3 A 

37T 


sin(cooTo) + too 


3 B 
dT\ 


cos((o 0 To). 


Therefore, if the dependence on 7\ of the coefficients 
A and B is selected so that 


2tt> 0 + -j-(A 2 + B 2 ) = 0, 

o 1 i 4 

-2co 0 ^ + ^(A 2 + B 2 ) = 0, 


then the resonant forcing terms that are proportional 
to cos (coo To) and sin(tooTo) and that are responsible 
for the secular response in U\ will be removed from the 
right-hand side of (25), and consequently (7i will now 
be a periodic function of To. The dependence of A and 
B on longer timescales T2, T 3 , . . . , Tk is determined sim- 
ilarly, order by order, so that U 2 , t/3 , . . . , Uk are periodic 
functions of To. Once this series has been constructed, 
the dependence on f may be restored by the substitu- 
tion Tk = E k t. This procedure produces an asymptotic 
expansion that is uniformly consistent with the order- 
ing relation (10) for times t that satisfy t = 0 (e~ k ) as 

E - 0 . 


The interpretation of the system (26) is that, as 
expected, the frequency of oscillation depends on the 
amplitude in nonlinear systems. Indeed, from (26) it is 


Another set of methods for dealing with singularly per- 
turbed differential equations involves using the princi- 
ple of dominant balance to isolate different domains 
of the independent variables in which different asymp- 
totic expansions apply and then “matching” the expan- 
sions together to produce an approximate solution that 
is uniformly accurate with respect to the independent 
variable. A typical context is a singularly perturbed 
boundary-value problem such as 

eu" + u' + fu = 0, u(0\e) = u(l;f) = 1. (27) 

Here / = fix) is given, and u(x~,e) is desired for e 
sufficiently small (in which case one can prove that 
this problem has a unique solution). For the differential 
equation itself, one dominant balance is that between 
u' and fu. The expansion based on this balance is just 
the power-series expansion 

u(x\e)~ ^ E n u n (x), e — ■ 0, (28) 

n= 0 

and the leading term u oix) satisfies the limiting equa- 
tion Uq(x) + f(x)uo(x) = 0, which has the general 
solution 

tto(x) = Co exp (J /(g) dg). (29) 

Given u oix), the procedure can be continued in the 
usual way to obtain successively the coefficients u n ix). 
At each order, one additional arbitrary constant C n is 
generated. 

The expansion (28) is insufficient to solve the bound- 
ary-value problem (27) because at each order there 
are two boundary conditions imposed on each of the 
functions u„ix) but there is only one constant avail- 
able to satisfy them. To seek additional expansions, we 
can introduce scalings of the independent variable to 
suggest other dominant balances. For example, if we 
set x = e p y and write v iy\E) = uix\s), then the 
differential equation becomes 

e 1 ^ v" iy\E) + £- p viy\E) + fi£ p y)viy\E) = 0. 
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The first two terms balance if p = 1, and this is 
a dominant balance if / is continuous on [0,1] and 
hence bounded. Moreover, if f is smooth, f(sy) can be 
expanded in Taylor series for small e, and it becomes 
clear that this dominant balance leads to an expansion 
of v of the form 

00 

v (y\a) ~ X £ n v n (y), e - 0. (30) 

n = 0 

Here the leading term vo (y) will satisfy the limiting 
equation v' 0 ' (y) + Vq (y) = 0, which has the general 
solution 

v 0 {y) = A 0 e~ y + B 0 , (31) 

and the standard perturbation procedure allows one to 
calculate v n (y) order by order, introducing two new 
integration constants each time. 

The expansion (28) turns out to be a good approxi- 
mation for u(x\ e) for x in intervals of the form [5, 1] 
for 8 > 0, as long as the constants are chosen to sat- 
isfy the boundary condition at x = 1: ito(l) = 1 and 
u n (l) = 0 for n ^ 1. This expansion therefore holds 
in “most” of the domain; in fluid dynamics problems 
this corresponds to flow in regions away from a prob- 
lematic boundary, the so-called outer flow. Hence (28) 
is called an outer expansion. 

On the other hand, the expansion (30) provides a 
good approximation for u{x\e) = v(x/e\e) in the 
“boundary layer” near x = 0 of thickness proportional 
to e , provided that 

(i) the integration constants are chosen to satisfy the 
boundary condition at x = y = 0, which forces 

vo(0) = 1 and v n (0) = 0 for n ^ 1; 

and 

(ii) the remaining constant at each order is deter- 
mined so that the inner expansion (30) is compati- 
ble with the outer expansion (28) in some common 
“overlap domain” of x-values. 

Choosing the constants to satisfy compatibility of the 
expansions is called matching of asymptotic expan- 
sions. In general, matching involves choosing some 
intermediate scale on which y is large while x is small 
as e -> 0; for example, fixing z > 0, we could set 
x = e 1/2 z and then y = x/e = e~ 1I2 z. With this sub- 
stitution one writes both inner and outer expansions 
in terms of the common independent variable z and 
re-expands both with respect to a suitable asymptotic 
sequence of functions of e . Equating these expansions 
term-by-term then yields relations among the constants 


in the two expansions. The common expansion with z 
fixed is sometimes called an intermediate expansion. 

In the problem at hand, to satisfy the boundary con- 
dition at x = 1 we should choose Co = 1 in (29), 
while to satisfy vo(0) = 1 we require Ao + Bo = 1 
in (31). The matching condition at this leading order 
reads u o(x = 0) = vo (y = +°°), which implies that 
B o = exp Jo /(£) d§. The constants Ao, Bo, and Co have 
clearly thus been determined by a combination of the 
two imposed boundary conditions and an asymptotic 
matching condition. The procedure can be continued 
to arbitrarily high order in e . 

Successful matching of asymptotic expansions for 
boundary-layer problems yields two different expan- 
sions that are valid in different parts of the physical 
domain. For some purposes it is useful to have a single 
approximation that is uniformly valid over the whole 
domain. Matching again plays a role here, as the cor- 
rect formula for the uniformly valid approximation is 
the sum of corresponding partial sums of the inner and 
outer expansions, minus the corresponding terms of 
the intermediate expansion (which would otherwise be 
counted twice, it turns out). 

Another application of matched asymptotic expan- 
sions is to problems involving periodic behavior that 
is alternately dominated by “fast" and “slow” dynam- 
ics, so-called relaxation oscillations. The slow parts of 
the cycle correspond to outer asymptotic expansions, 
and the rapid parts of the cycle are analyzed by rescal- 
ing and dominant balance arguments and resemble 
the inner expansions from boundary-layer problems. 
Matching is required to enforce the periodicity of the 
solution. 
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IV.6 Calculus of Variations 

Irene Fonseca and Giovanni Leoni 


1 History 

The calculus of variations is a branch of mathemati- 
cal analysis that studies extrema and critical points of 
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functionals (or energies). Here, by functional we mean 
a mapping from a function space to the real numbers. 

One of the first questions that may be framed within 
this theory is Dido’s isoperimetric problem (see sec- 
tion 2.3), finding the shape of a curve of prescribed 
perimeter that maximizes the area enclosed. Dido was a 
Phoenician princess who emigrated to North Africa and 
upon arrival obtained from the native chief as much ter- 
ritory as she could enclose with an ox hide. She cut the 
hide into a long strip and used it to delineate the ter- 
ritory later known as Carthage, bounded by a straight 
coastal line and a semicircle. 

It is commonly accepted that the systematic develop- 
ment of the theory of the calculus of variations began 
with the brachistochrone curve problem proposed by 
Johann Bernoulli in 1696. Consider two points A and 
B on the same vertical plane but on different vertical 
lines. Assume that A is higher than B and that a parti- 
cle M is moving from A to B along a curve and under the 
action of gravity. The curve that minimizes the time it 
takes M to travel between A and B is called the brachis- 
tochrone. The solution to this problem required the 
use of infinitesimal calculus and was later found by 
Jacob Bernoulli, Newton, Leibniz, and de l’Hopital. The 
arguments thus developed led to the development of 
the foundations of the calculus of variations by Euler. 
Important contributions to the subject are attributed 
to Dirichlet, Hilbert, Lebesgue, Riemann, Tonelli, and 
Weierstrass, among many others. 

The common feature underlying Dido’s isoperimetric 
problem and the brachistochrone curve problem is that 
one seeks to maximize or minimize a functional over 
a class of competitors satisfying given constraints. In 
both cases the functional is given by an integral of a 
density that depends on an underlying field and some 
of its derivatives, and this wall be the prototype we will 
adopt in what follows. To be precise, we consider a 
functional 

ueX~F(u):= ( f(x,u(x),Xu(x))dx, (1) 
In 

where X is a function space (usually an L p space or a 
Sobolev-type space), u: Q — ■ R d , with Q c R N an open 
set, N and d are positive integers, and the density is a 
function f(x,u, §), with (x,u, §) 6 12xR d xR dx7v .Here 
and in what follows, V u stands for the d x N matrix- 
valued distributional derivative of u. 

The calculus of variations is a vast theory, so here we 
choose to highlight only some contemporary aspects of 
the field. We conclude the article by mentioning a few 


areas that are at the forefront of application and that 
are driving current research. 

2 Extrema 

In this section we address fundamental minimization 
problems and relevant techniques in the calculus of 
variations. In geometry, the simplest example is the 
problem of finding the curve of shortest length con- 
necting two points: a geodesic. A (continuous) curve 
joining two points A, B E R d is represented by a (con- 
tinuous) function y: [0,1] — R d such that y(0) = A, 
y(l) = B, and its length is given by 

L(y) := sup | £ |y(tf) - y(t;-i)l j, 

L i=i J 

where the supremum is taken over all partitions 0 = 
to < ti < ■ ■ ■ < t n = 1, n e N, of the interval [0, 1]. If y 
is smooth, then L(y) = Jq 1 I y'(t)l dt. In the absence of 
constraints, the geodesic is the straight segment with 
endpoints A and B, and so L(y) = |A-B|, where \A-B\ 
stands for the magnitude (or length) of the vector 0A- 
0B with 0 being the origin. In applications the curves 
are often restricted to lie on a given manifold, e.g., a 
sphere (in this case, the geodesic is the shortest great 
circle joining A and B). 

2.1 Minimal Surfaces 

A minimal surface is a surface of least area among all 
those bounded by a given closed curve. The problem of 
finding minimal surfaces, called the Plateau problem, 
was first solved in three dimensions in the 1930s by 
Douglas and, separately, by Rado, and then in the 1960s 
several authors, including Almgren, De Giorgi, Fleming, 
and Federer, addressed it using geometric measure- 
theoretical tools. This approach gives existence of solu- 
tions in a “weak sense,” and establishing their regu- 
larity is significantly more involved. De Giorgi proved 
that minimal surfaces are analytic except on a singular 
set of dimension at most N - 1. Eater, Federer, based 
on earlier results by Almgren and Simons, improved 
the dimension of the singular set to IV - 8. The sharp- 
ness of this estimate was confirmed with an example 
by Bombieri, De Giorgi, and Giusti. 

Important minimal surfaces are the nonparametric 
minimal surfaces, which are given as graphs of real- 
valued functions. To be precise, given an open set 13 c 
R n and a smooth function u : Q — ■ R, the area of the 
graph of u, {(x,u(x)): x 6 13}, is given by 

F(u) := J ^/l + | Vwj 2 dx. (2) 
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It can be shown that u minimizes the area of its graph 
subject to prescribed values on the boundary of Q if 
Vu 


div 


( VU ) 

W 1 + \Vu\ 2 / 


j V it | 2 , 

2.2 The Willmore Functional 


0 in Q. 


Many smooth surfaces, including tori, have recently 
been obtained as minima or critical points of certain 
geometrical functionals in the calculus of variations. An 
important example is the Willmore (or bending) energy 
of a compact surface S embedded in R 3 , namely the sur- 
face integral AVIS) := J s H 2 dcr, where H := \ik 1 -\-k 2 ) 
and ki and fc 2 are the principal curvatures of S. This 
energy has a wide scope of applications, ranging from 
materials science (e.g., elastic shells, bending energy) to 
mathematical biology (e.g., cell membranes) to image 
segmentation in computer vision (e.g., staircasing). 

Critical points of AV are called Willmore surfaces and 
satisfy the Euler-Lagrange equation 

A S H + 2H(H 2 - K) = 0, 


over all smooth sets E c R N with prescribed volume 
and where v(x) is the outward unit normal to dE at 
x. The right variational framework for this problem is 
within the class of sets of finite perimeter. The solution, 
which exists and is unique up to translations, is called 
the Wulff shape. A key ingredient in the proof is the 
Brunn-Minkowski inequality 

(£ N (A)) 1/N + (£ N (B)) l/N ^ ( £ n (A + B )) 1/n , (4) 

which holds for all Lebesgue measurable sets A,B c R N 
such that A + B is also Lebesgue measurable. Here, £ N 
stands for the JV-dimensional Lebesgue measure. 

3 The Euler-Lagrange Equation 

Consider the functional (1), in the scalar case d = 
1 and where f is of class C 1 and X is the Sobolev 
space X = W^iO), 1 ^ p ^ +oo, of all functions 
u e L p (Q) whose distributional gradient Vu belongs 
to L v (Q\M. n ). Let u e X be a local minimizer of the 
functional F\ that is, 


where K := kik .2 is the Gaussian curvature and As is 
the Laplace-Beltrami operator. 

In the 1920s it was shown by Blaschke and, sepa- 
rately, by Thomsen that the Willmore energy is invari- 
ant under conformal transformations of R 3 . Also, the 
Willmore energy is minimized by spheres, v\ith result- 
ing energy value 4tt. Therefore, AV (S) - 4tt describes 
how much S differs from a sphere in terms of its bend- 
ing. The problem of minimizing the Willmore energy 
among the class of embedded tori T was proposed by 
Willmore, who conjectured in 1965 that AV(T) ^ 2tt 2 . 
This conjecture was proved by Marques and Neves in 
2012 . 


2.3 Isoperimetric Problems and the Wulff Shape 


The understanding of the surface structure of crystals 
plays a central role in many fields of physics, chem- 
istry, and materials science. If the dimension of the 
crystals is sufficiently small, then the leading morpho- 
logical mechanism is driven by the minimization of sur- 
face energy. Since the work of Herring in the 1950s, a 
classical problem in this field is to determine the crys- 
talline shape that has the smallest surface energy for a 
given volume. To be precise, we seek to minimize the 
surface integral 



(3) 


fix, it(x), Vu(x)) dx ^ fix, v(x), Vv(x)) dx 

Ju Ju 

for every open subset U compactly contained in Q, and 
for all v such that u - v 6 Wq' p iU), where Wq’ p ([/) is 
the space of all functions in W 1,p (U) that “vanish” on 
the boundary of dU. Note that v will then coincide with 
u outside the set U. If cp 6 C 3 (12), then u + tcp, t 6 R, 
are admissible, and thus 

t git) := Fiu+ tqp) 

has a minimum at f = 0. Therefore, under appropriate 
growth conditions on /, we have that g’ (0) = 0, i.e., 

f ( T M-ix,u,Vu)^- + ^ix,u,Vu)q?) dx = 0. 
Jn \ dxi 5u ) 

(5) 

A function u 6 X satisfying (5) is said to be a weak solu- 
tion of the Euler-Lagrange equation associated to (1). 

Under suitable regularity conditions on / and u, (5) 
can be written in the strong form 

div( Vg/(x,it, Vu)) = ix,u, Xu), (6) 

where Vj/(x, u, 5) is the gradient of the function 
fix,u, ■)• 

In the vectorial case d > 1, the same argument leads 
to a system of partial differential equations (PDEs) in 
place of (5). 
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4 Variational Inequalities and Free-Boundary 
and Free-Discontinuity Problems 

We now add a constraint to the minimization problem 
considered in the previous section. To be precise, let 
d = 1 and let (f> be a function in Q. If u is a local min- 
imizer of (1) among all functions v e W l ' p (Q) subject 
to the constraint v ^ <f> in Q, then the variation u + tqp 
is admissible if qp ^ 0 and f ^ 0. Therefore, the function 
g satisfies g' ( 0) ^ 0, and the Euler-Lagrange equation 
(5) becomes the variational inequality 

f ( Y u, Vu)^- + X^{x,u, Vu)qp] dx ^ 0 

Jo V “ oxi du 1 

for all nonnegative qp G C\ (fi). This is called the obsta- 
cle problem, and the coincidence set { u = <p} is not 
known a priori and is called the free boundary. This 
is an example of a broad class of variational inequali- 
ties and free boundary problems that have applications 
in a variety of contexts, including the modeling of the 
melting of ice (the Stefan problem), lubrication, and the 
filtration of a liquid through a porous medium. 

A related class of minimization problems in which 
the unknowns are both an underlying field u and a sub- 
set E of Q is the class of free discontinuity problems 
that are characterized by the competition between a 
volume energy of the type (1) and a surface energy, e.g., 
as in (3). Important examples are in the study of liquid 
crystals, the optimal design of composite materials in 
continuum mechanics (see section 13.3), and image 
segmentation in computer vision (see section 13.4). 

5 Lagrange Multipliers 

The method of Lagrange multipliers in Banach spaces 
is used to find extrema of functionals G : X — ■ R subject 
to a constraint 

{x eX:¥(x) = 0}, (7) 

where ¥ : X — Y is another functional and X and Y are 
Banach spaces. It can be shown that if G and ¥ are of 
class C 1 and u 6 X is an extremum of G subject to (7), 
and if the derivative D ¥(u)\ X — Y is surjective, then 
there exists a continuous, linear functional A : Y — R 
such that 

DG(w) + A o D¥(u) = 0, (8) 

where o stands for the composition operator between 
functions. The functional A is called a Lagrange multi- 
plier. 


In the special case in which Y = R, A may be identi- 
fied with a scalar, still denoted by A, and (8) takes the 
familiar form 

DG(w) + AD ¥(u) = 0. 

Therefore, candidates for extrema may be found among 
all critical points of the family of functionals G + AT, 
A G R. 

If G has the form (1) and X = W 1 ^ (Q-R d ), 1 ^ p ^ 
+ oo, then typical examples of ¥ are 

¥(u)\= \u\ s dx - Ci or ¥{u)\= udx-C 2 
Jo Jo 

for some constants ci 6 1, C 2 e R d , and 1 ^ s < +oo. 

6 Minimax Methods 

Minimax methods are used to establish the existence 
of saddle points of the functional (1), i.e., critical points 
that are not extrema. More generally, for C 1 functionals 
G: X — ■ R, where X is an infinite-dimensional Banach 
space, as introduced in section 5, the Palais-Smale com- 
pactness condition (hereafter simply referred to as the 
PS condition) plays the role of compactness in the finite- 
dimensional case. To be precise, G satisfies the PS con- 
dition if whenever {u n } c X is such that {G(u n )} is a 
bounded sequence in R and DG(u„) ->• 0 in the dual of 
X, X' , then {u n } admits a convergent subsequence. 

An important result for the existence of saddle points 
that uses the PS condition is the mountain pass lemma 
of Ambrosetti and Rabinowitz, which states that, if G 
satisfies the PS condition, if G(0) = 0, and if there are 
r > 0 and ito G X \ B(0,r) such that 

inf G > 0 and G(uo) ^ 0, 

35(0, r) 

then 

inf sup G (it) 

)'GC uey 

is a critical value, where C is the set of all continuous 
curves from [0, 1] into X joining 0 to tto. 

In addition, minimax methods can be used to prove 
the existence of multiple critical points of functionals 
G that satisfy certain symmetry properties, for exam- 
ple, the generalization of the result by Ljusternik and 
Schnirelmann for symmetric functions to the infinite- 
dimensional case. 

7 Lower Semicontinuity 
7.1 The Direct Method 

The direct method in the calculus of variations provides 
conditions on the function space X and on a functional 
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G, as introduced in section 5, that guarantee the exis- 
tence of minimizers of G. The method consists of the 
following steps. 

Step 1. Consider a minimizing sequence {u n } c X, i.e., 
lim^oo G(u n ) = inf llG A' G(u). 

Step 2. Prove that {u n } admits a subsequence {u m } 
converging to some uq g X with respect to some (weak) 
topology t in X. When G has an integral representation 
of the form (1), this is usually a consequence of a priori 
coercivity conditions on the integrand /. 

Step 3. Establish the sequential lower semicontinuity of 
G with respect to t, i.e., liminf„ -co G(v n ) ^ G(v) when- 
ever the sequence { v n } c X converges weakly to v G X 
with respect to t. 

Step 4. Conclude that uo minimizes G. Indeed, 
inf G(u) = lim G(u n ) = lim G(u ni ) 

ueX n— oo k-*oo 

^ G(uo) ^ inf G(it). 
ueX 

7.2 Integrands: Convex, Polyconvex, Quasiconvex, 
and Rank-One Convex 

In view of step 3 above, it is important to character- 
ize the class of integrands / in (1) for which the cor- 
responding functional F is sequentially lower semicon- 
tinuous with respect to t. In the case in which X is 
the Sobolev space W 1 ^ (Q m ,R d ), 1 ^ p ^ +oo, and t 
is the weak topology (or weak-* topology p = +oo), 
this is related to convexity-type properties of fix, u, ■ ) . 
If min{d,N} = 1, then under appropriate growth and 
regularity conditions it can be shown that convexity of 
f(x,u,-) is necessary and sufficient. More generally, 
if min{d,iV} > 1, then the corresponding condition is 
called quasiconvexity, to be precise, f(x,u, ■ ) is said to 
be quasiconvex if 

fix, u, §) < f fix, u, g + Vqply)) Ay 
J(0,1) W 

for all £ G R dxN and all < p G ^’“((O, l) N ;R d ), 
whenever the right-hand side in this inequality is well 
defined. Since this condition is nonlocal, in applica- 
tions in mechanics one often studies related classes of 
integrands, such as polyconvex and rank-one convex 
functions, for which there are algebraic criteria. 

8 Relaxation 

Inmost applications, step 3 in section 7.1 fails, and this 
leads to an important topic at the core of the calculus 


of variations, namely, the introduction of a relaxed, or 
effective, energy Q that is related to G, as introduced in 
section 5, as follows. 

(a) Cj is sequentially lower semicontinuous with re- 
spect to T. 

(b) Q ^ G and C-j inherits coercivity properties from 

G. 

(c) min ueX <g = inf'uex G. 

When G is of the type (1), a central problem is to under- 
stand if Q has an integral form of the type (1) for some 
new integrand h and then, if it has, to understand what 
the relation between h and the original integrand f is. 

IfX = W^ID), p iy 1, and t is the weak topology, 
then under appropriate growth and regularity condi- 
tions it can be shown that h lx, u, ■ ) is the convex envel- 
ope of fix, u, ■), i.e., the greatest convex function less 
than flx,u, ■)■ In the vectorial case d > 1, the convex 
envelope is replaced by a similar notion of quasiconvex 
envelope (see section 7.2). 

9 T-Convergence 

In physical problems the behavior of a system is often 
described in terms of a sequence {G n } , n e N, of energy 
functionals G n - X -» [-oo,+oo], where A" is a metric 
space with a metric d. Is it possible to identify a limiting 
energy G„ that sheds light on the qualitative properties 
of this family and that has the property that minimizers 
of G n converge to minimizers of Goo? 

The notion of r -convergence, which was introduced 
by De Giorgi, provides a tool for answering these ques- 
tions. To motivate this concept with an example, con- 
sider a fluid confined in a container Q c R N . Assume 
that the total mass of the fluid is m, so that admissible 
density distributions u : Q — ■ R satisfy the constraint 
ulx) dx = m. The total energy is given by the func- 
tional u ~ W lulx)) dx, where W: R — [0, oo) is the 
energy per unit volume. Assume that W supports two 
phases a < b, that is, W is a double-well potential, with 
{ u G R: Wlu) = 0} = {a,b}. Then any density distri- 
bution u that renders the body stable in the sense of 
Gibbs is a minimizer of the following problem: 

min | J Wlulx)) dx: J ulx) dx = mj. (To) 

If £ n (I 3) = 1 and a < m < b, then given any mea- 
surable set E c Q with £ N (E) = (b - m)/(b - a), the 
function u = uxe + bxn\E is a solution of problem (Tq). 
This lack of uniqueness is due to the fact that inter- 
faces between the two phases a and b can be formed 
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without increasing the total energy. The physically pre- 
ferred solutions should be the ones that arise as limit- 
ing cases of a theory that penalizes interfacial energy, 
so it is expected that these solutions should minimize 
the surface area of dE n d. 

In the van der Waals-Cahn-Hilliard theory of phase 
transitions, the energy depends not only on the density 
u but also on its gradient. To be precise, 

f W(u(x))dx + s 2 f Vw(x)| 2 dx. 

Jn Jn 

Note that the gradient term penalizes rapid changes in 
the density u, and thus it plays the role of an inter- 
facial energy. Stable density distributions u are now 
solutions of the minimization problem 

minjj W(u(x)) dx + a 2 J | Vu(x)| 2 dxj, (T E ) 

where the minimum is taken over all smooth func- 
tions u satisfying f n u(x)dx = m. In 1983 Gurtin 
conjectured that the limits, as a — 0, of solutions of 
CPs) are solutions of (To) with minimal surface area. 
Using results of Modica and Mortola, this conjecture 
was proved independently by Modica and by Sternberg 
in the setting of T-convergence. 

TheF-limit Goo - X -> [-oo,+oo] of {G n } with respect 
to a metric d, when it exists, is defined uniquely by the 
following properties. 

(i) The liminf inequality. For every sequence {u n } c 
X converging to u e X with respect to d, 

Goo (u) ^ liminf G n (u n ). 

n— oo 

(ii) The lim sup inequality. For every u e X there 
exists a sequence {u n } c X converging to u e X 
with respect to d such that 

Goo(u) > limsupGn(wn). 

n— oo 

This notion may be extended to the case in which the 
convergence of the sequences is taken with respect to 
some weak topology rather than the topology induced 
by the metric d. In this context, we remark that, when 
the sequence {G n } reduces to a single energy func- 
tional {G}, under appropriate growth and coercivity 
assumptions, Goo coincides with the relaxed energy Q, 
as discussed in section 8. 

Other important applications of T-convergence in- 
clude the Ginzburg-Landau theory for superconduc- 
tivity (see section 13.5), homogenization of variational 
problems (see section 13.3), dimension reduction prob- 
lems in elasticity (see section 13.2), and free-discon- 
tinuity problems in image segmentation in computer 
vision (see section 13.4) and in fracture mechanics. 


10 Regularity 

Optimal regularity of minimizers and local minimiz- 
ers of the energy (1) in the vectorial case d ^ 2, and 
when A' = W 1 - p (d\R d ), 1 ^ p ^ +co,is mostly an open 
question. In the scalar case d = 1 there is an extensive 
body of literature on the regularity of weak solutions 
of the Euler-Lagrange equation (5), stemming from a 
fundamental result of De Giorgi in the late 1950s that 
was independently obtained by Nash. For d ^ 2, (local) 
minimizers of (1) are not generally everywhere smooth. 
On the other hand, and under suitable hypotheses on 
the integrand /, it can be shown that partial regularity 
holds, i.e., if it is a local minimizer, then there exists 
an open subset of d, do, of full measure such that 
u G C 1,a (fi 0 ;K d ) for some a G (0,1). Sharp estimates 
of tx and of the Hausdorff dimension of the singular set 
X u := d \ do are still unknown. 

1 1 Symmetrization 

Rearrangements of sets preserve their measure while 
modifying their geometry to achieve specific symme- 
tries. In turn, rearrangements of a function u yield 
new functions that have desired symmetry properties 
and that are obtained via suitable rearrangements of 
the t-superlevel sets of u, dt := {x G d\ u(x) > t}. 
These tools are used in a variety of contexts, from har- 
monic analysis and PDEs to the spectral theory of dif- 
ferential operators. In the calculus of variations, they 
are often found in the study of extrema of function- 
als of type (1). Among the most common rearrange- 
ments we mention the directional monotone decreas- 
ing rearrangement, the star-shaped rearrangement, the 
directional Steiner symmetrization, the Schwarz sym- 
metrization, the circular and spherical symmetrization, 
and the radial symmetrization. 

Of these, we highlight the Schwarz symmetrization, 
which is the most frequently used in the calculus of 
variations. If u is a nonnegative measurable function 
with compact support in R N , then its Schwartz sym- 
metric rearrangement is the (unique) spherically sym- 
metric and decreasing function u* such that for all 
t > 0 the t-superlevel sets of u and u* have the same 
measure. 

When d = R N , it can be shown that u * preserves the 
IT -norm of u and the regularity of u up to first order; 
that is, if u belongs to W 1 ' P (M N ), then so does u * , 
1 ^ p ^ +oo. Moreover, by the Polya-Szego inequality, 
||Vn*||p ^ HVullp, and we remark that for p = oo 
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this is obtained using the Brunn-Minkowski inequality 
discussed in section 2.3. 

Another important inequality relating u and u* is 
the Riesz inequality, and the Faber-Krahn inequality 
compares eigenvalues of the Dirichlet problems in Q 
and in 12*. Classical applications of rearrangements 
include the derivation of the sharp constant in the 
Sobolev-Gagliardo-Nirenberg inequality in 1V 1 ' P ’(R N ), 
1 < p < N, as well as in the Young inequality and the 
Hardy-Littlewood-Sobolev inequality. 

Finally, we remark that the first and most important 
application of Steiner symmetrization is the isoperi- 
metric property of balls (see Dido’s problem in sec- 
tion 1). 

12 Duality Theory 

Duality theory associates with a minimization problem 
( P) a maximization problem CP*), called the dual prob- 
lem, and studies the relation between these two. It has 
important applications in several disciplines, includ- 
ing economics and mechanics, and different areas of 
mathematics, such as the calculus of variations, convex 
analysis, and numerical analysis. 

The theory of dual problems is inspired by the notion 
of duality in convex analysis and by the Fenchel trans- 
form f* of a function /: R N -> [-oo,+oo], defined 
as 

/*(ij) := sup{q ■£-/(§):§ £R n 1 forq gR n . 

As an example, consider the minimization problem 

illf {\o.f < ' VU)dX : U G W o V(n) }, (?) 

with / : M. n — • R. If / satisfies appropriate growth and 
convexity conditions, then the dual problem (P*) is 
given by 

sup | - J /* (v(x)) dx : v 6 L a (0',M. N ), 

div v = 0 in Q J , 

where 1/p + l/q = 1. The latter problem may be simpler 
to handle in specific situations, e.g., for nonparametric 
minimal surfaces and with / given as in (2), where, due 
to lack of coercivity, IP) may not admit a solution in X. 

13 Some Contemporary Applications 

There is a plethora of applications of the calculus 
of variations. Classical ones include Hamiltonians and 
Lagrangians, the Hamilton-Jacobi equation, conserva- 
tion laws, Noether’s theorem, and optimal control. 


Below we focus on a few contemporary applications 
that are pushing the frontiers of the theory in novel 
directions. 

13.1 Elasticity 

Consider an elastic body that occupies a domain Q c 
R 3 in a given reference configuration. The deformations 
of the body can be described by maps u\ Q — • R 3 . If 
the body is homogeneous, then the total elastic energy 
corresponding to u is given by the functional 

Flu) := f f(Vu(x))dx, (9) 

In 

where / is the stored-energy density of the material. In 
order to prevent interpenetration of matter, the defor- 
mations should be invertible and it should require an 
infinite amount of energy to violate this property, i.e., 

/(£) ->■ +oo as detg — 0 + . (10) 

Also, / needs to be frame indifferent, i.e., 

/(«§)=/(§) ( 11 ) 

for all rotations R and all g £ R 3x3 . 

Under appropriate coercivity and convex-type condi- 
tions on f (see section 7.2), and under suitable bound- 
ary conditions, it can be shown that F admits a global 
minimizer uo. However, the regularity of uq is still an 
open problem, and the Euler-Lagrange equation can- 
not therefore be derived (see section 3). In addition, the 
existence of local minimizers remains unsolved. 

13.2 Dimension Reduction 

An important problem in elasticity is the derivation 
of models for thin structures, such as membranes, 
shells, plates, rods, and beams, from three-dimensional 
elasticity theory. The mathematically rigorous analysis 
was initiated by Acerbi, Buttazzo, and Percivale in the 
1990s for rods, and this was followed by the work of 
Le Dret and Raoult for membranes. Recent contribu- 
tions by Friesecke, James, and Muller have allowed us 
to handle the physical requirements (10) and (11). The 
main tool underlying these works is T-convergence (see 
section 9). 

To illustrate the deduction in the case of membranes, 
consider a thin cylindrical elastic body of thickness 
2f > 0 occupying the reference configuration Q s := 
to x (-£,£), with to c R 2 . Using the typical rescaling 

(xi,x 2 ,x 3 ) - (y1.y2.y3) ■= (xi ,x 2 ,x 3 /e), 

the deformations u of O e now correspond to deforma- 
tion v of the fixed domain f2j, through the formula 
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v(yi,y 2 ,y 3 ) = w(xi,X 2 ,X 3 ). Therefore, 


- f(Vu)dx 
£ Jn c 



dv dv 1 dv \ 

Syi ' 3y 2 ’ e dy?, ) y 


The right-hand side of the previous equality yields a 
family of functionals to which the theory of T-conver- 
gence is applied. 


13.3 Homogenization 

homogenization theory [11.17] is used to describe 
the macroscopic behavior of heterogeneous compos- 
ite materials, which are characterized by having two 
or more finely mixed material components. Composite 
materials have important technological and industrial 
applications as their effective properties are often bet- 
ter than the corresponding properties of the individual 
constituents. The study of these materials falls within 
the so-called multiscale problems, with the two relevant 
scales here being the microscopic scale at the level of the 
heterogeneities and the macroscopic scale describing 
the resulting “homogeneous” material. Mathematically, 
the properties of composite materials can be described 
in terms of PDEs with fast oscillating coefficients or 
in terms of energy functionals that depend on a small 
parameter e . As an example, consider a material matrix 
A with corresponding stored-energy density /a, with 
periodically distributed inclusions of another material 
B with stored-energy density / 'b , whose periodicity cell 
has side-length e. The total energy of the composite is 
then given by 

L il 1 ~ x { > s))-f A( ' S7u) + x (j dx ’ 

where x is the characteristic function of the locus of 
material B contained in the unit cube Q of material A, 
extended periodically to R 3 with period Q. The goal 
here is to characterize the “homogenized” energy when 
e — 0 + using F - convergence (see section 9). 


13.4 Computer Vision 

Several problems in computer vision can be treated 
variationally, including image segmentation (e.g., the 
Mumford-Shah and Blake-Zisserman models), image 
morphing, image denoising (e.g., the Perona-Malik 
scheme and the Rudin-Osher-Fatemi total variation 
model), and inpainting (e.g., recolorization). 

The Mumford-Shah model provides a good example 
of the use of the calculus of variations to treat free dis- 
continuity problems. Let 12 be a rectangle in the plane, 
representing the locus of the image, with gray levels 
given by a function g\ 12 — [0,1]. We want to find an 


approximation of g that is smooth outside a set K of 
sharp contours related to the set of discontinuities of 
g. This leads to the minimization of the functional 



a(u - g) 2 ) dx + /llength(K n 12) 


over all contour curves K and functions u e C 1 (Q\K). 
The first term in this energy functional is minimized 
when u is constant outside K, and it therefore forces u 
not to vary much outside K. The second term is mini- 
mized when u = g outside K, and hence u is required 
to stay close to the original gray level g. The last term is 
minimized when K has length as short as possible. The 
existence of a minimizing pair ( u,K ) was established 
by De Giorgi, Carriero, and Lead, with u in a class of 
functions larger than C 1 (13 \K), to be precise, the space 
of functions of special bounded variation. The full regu- 
larity of these solutions u and the structure of K remain 
an open problem. 


13.5 The Ginzburg-Landau Theory for 
Superconductivity 

In the 1950s Ginzburg and Landau proposed a mathe- 
matical theory to study phase transition problems in 
superconductivity; there are similar formulations to 
address problems in superfluids, e.g., helium II, and 
in XY magnetism. In its simplest form, the Ginzburg- 
Landau functional reduces to 

F e (u) := \ f ! Vu| 2 dx + f ( | ir | 2 - l) 2 dx, 

2 In 4 e z )n 

where 13 c R 2 is a star-shaped domain, the condensate 
wave function u e W 1,2 (13;R 2 ) is an order parameter 
with two degrees of freedom, and the parameter £ is a 
(small) characteristic length. Given g : 313 — S 1 , with S 1 
the unit circle in R 2 centered at the origin, we are inter- 
ested in characterizing the limits of minimizers u £ of 
F e subject to the boundary condition u E = g on 313. 
Under suitable geometric conditions on g (related to 
the winding number), Bethuel, Brezis, and Helein have 
shown that there are no limiting functions u in C (13 , S 1 ) 
that satisfy the boundary condition. Rather, the limiting 
functions are smooth outside a finite set of singulari- 
ties, called vortices. T- convergence techniques may be 
used to study this family of functionals (see section 9). 


13.6 Mass Transport 

Mass transportation was introduced by Monge in 1781, 
studied by the Nobel Prize winner Kantorovich in the 
1940s, and revived by Brenier in 1987. Since then, it 
has surfaced in a variety of areas, from economics to 
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optimization. Given a pile of sand of mass 1 and a hole 
of volume 1, we want to fill the hole with the sand while 
minimizing the cost of transportation. This problem is 
formulated using probability theory as follows. The pile 
and the hole are represented by probability measures 
p and v, with supports in measurable spaces X and 
Y, respectively. If A c X and B c Y are measurable 
sets, then p(A) measures the amount of sand in A and 
v(B) measures the amount of sand that can fill B. The 
cost of transportation is modeled by a measurable cost 
function c: X x Y R u +co. Kantorovich’s optimal 
transportation problem consists of minimizing 


c(x,y) d rr(x,y) 


over all probability measures onlxf such that tt(Ax 
Y ) = p{A) and7T(Xx£) = v(B), for all measurable sets 
A c X and B c Y. The main problem is to establish the 
existence of minimizers and to obtain their character- 
ization. This depends strongly on the cost function c 
and on the regularity of the measures p and v. There is 
a multitude of applications of this theory, and here we 
mention only that it can be used to give a simple proof 
of the Brunn-Minkowski inequality (see (4)). 


13.7 Gradient Flows 

Given a function h: R m — R of class C 2 , the gradient 
flow of h is the family of maps St'. R m — R m , f ^ 0, 
satisfying the following property. For every wo £ R" 1 , 
So(wo) '■= wo and the curve wt := Sr(u'o), t > 0, is the 
unique C 1 solution of the Cauchy problem 

-^-Wt = -Vh(wt) for t > 0, lim Wt = wn, (12) 
dt t-o+ 

if it exists. 

If D 2 h > «I for some ex e R, then it can be shown 
that the gradient flow exists, that it is unique, and that 
it satisfies a semigroup property, i.e., 

St+s(wo ) = S t (S s (w 0 )), lim S t (w 0 ) = w 0 , 

for every ivo e R m . 

A common way of approximating discretely the solu- 
tion of (12) is via the implicit Euler scheme , as follows. 
Given a time step t > 0, consider the partition of 
[ 0 , +00 ) 

{0 = f® < tj < ■ ■ ■ < t" < ■ ■ ■ }, 

where t" := n t. Define recursively a discrete sequence 
{IT"} as follows: assuming that IT" -1 has already been 
defined, let IT" be the unique minimizer of the function 

w - ^-\w + h(w). (13) 


Introduce the piecewise-linear function W T : [0, +<») — ■ 
R" 1 given by 


W T {t) 


t - t?- 1 


IT"' 1 


t" - t 


IT ” 


T 

for t e [t" _1 , t"]. If IT® — > wo as t - 0 + , thenit can be 
shown that {IT t } t >o converges to the solution of (12) 
as t —■ 0 + . 

This approximation scheme, here described for the 
finite-dimensional vector space R d , may be extended 
to the case in which R d is replaced by an infinite- 
dimensional metric space X, the function h is replaced 
by a functional G : X — ■ R, and the minimization pro- 
cedure in (13) is now a variational minimization prob- 
lem of the type addressed in section 7. This method 
is known as De Giorgi’s minimizing movements. Impor- 
tant applications include the study of a large class of 
parabolic PDEs. 
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IV. 7 Special Functions 

Nico M. Temme 


1 Introduction 

Usually we call a function “special” when, like the loga- 
rithm, the exponential function, and the trigonometric 
functions (the elementary transcendental functions), it 
belongs to the toolbox of the applied mathematician, 
the physicist, the engineer, or the statistician. Each 
function has particular notation associated with it and 
has a great number of known properties. 

The study of special functions is a branch of math- 
ematics with a distinguished history involving great 
names such as Euler, Gauss, Fourier, Legendre, Bessel, 
and Riemann. Much of their work was inspired by prob- 
lems from physics and by the resulting differential 
equations. This activity culminated in the publication 
in 1927 of the standard, and greatly influential, work A 
Course of Modern Analysis by Whittaker and Watson. 

Many other monographs are now available, some that 
contain the formulas without explanation and others 
that explain how the special functions arise in problems 
from physics and statistics. A major project called the 
Digital Library of Mathematical Functions recently cul- 
minated in the NIST Handbook of Mathematical Func- 
tions (a successor to Abramowitz and Stegun’s famed 
Handbook of Mathematical Functions), which is also 
readily accessible online. 

In physics, special functions arise as solutions of the 
linear second-order differential equations that result 
from separating the variables in a partial differential 
equation in some special coordinate system (such as 
spherical or cylindrical). In this way solutions of the 
wave equation, the diffusion equation, and so on are 
written in the form of series or integrals. 

In statistics, special functions arise as cumulative dis- 
tribution functions (gamma and beta distributions, for 
example). In number theory, zeta functions, Dirichlet 
series, and modular forms are used. 

Some topics that fall outside the scope of this article 
are mentioned in the final section. 


2 Bernoulli Numbers, Euler Numbers, 
and Stirling Numbers 


The Bernoulli numbers B n are defined by the generating 
function 


z 


e z - 1 


i 

n= 0 


Bn n 

— rZ , 

n\ 


\z\ < 2tt. 


Because the function z/(e z - 1) - 1 + \z is even, all B n 

with odd index n ^ 3 vanish: # 2 n+i = 0, n = 1,2,3, 

The first nonvanishing numbers are 

Bo = 1, Bi = -j, £2 = g, B 4 = -gj, Be = 4 2 ■ 


The Bernoulli numbers are named after Jakob Ber- 
noulli, who mentioned them in his posthumous Ars 
Conjectandi of 1713. He discussed summae potesta- 
tum, sums of equal powers of the first n integers; for a 
nonnegative integer p, 


n - 1 

X m p = 

m=0 


1 

p + 1 



B k n p+1 ~ k , 


where (£) = n\/[k\(n - k) !] . The Bernoulli numbers 
occur in practically every field of mathematics and par- 
ticularly in combinatorial theory, finite-difference cal- 
culus, numerical analysis, analytic number theory, and 
probability theory. 

The Bernoulli polynomials are defined by the gener- 
ating function 



00 


I 

n = 0 


Bn(x) n 

— Z n 

n\ 


\z\ < 2 tt. 


The first few polynomials are 


Bq(x) = 1 , Bi(x) = X - B 2 W = X 2 - X + g. 


For the Euler numbers E n , we have the generating 
function 

1 2e z v E n n , , 1 

— i — = 2z , i = 2. — r z ”’ |z| < 2 n - 

coshz e zz + 1 „ nl z 

n = 0 

In contrast to the Bernoulli numbers, the Euler numbers 
are integers. The first few are 


E 0 = 1, E 2 = —1, £4 = 5, Ee = — 61, 


while those with odd index are zero. 

The numbers s(n, k) in the generating function 


X s i n < fc) 

n = 0 


X™ 

n\ 


(ln(l + x)) k 
k! 


\x\ < 1 , 


and the numbers S(n, k) in the generating function 


X S(n,k) 

n = 0 


X 

n\ 


( e * _ 1 )k 


k\ 


are called Stirling numbers of the first and second 
kind, respectively. They are defined for 0 ^ k ^ n 
and are named after James Stirling. The Stirling num- 
bers of the second kind have the following combinato- 
rial interpretation: Sin, k) is the number of partitions 
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of {1,2 n} into exactly k nonempty subsets. For 

example, SI 4,2) = 7, since 

{1,2,3,4} = {1} u {2,3,4} = {2} U {1,3,4} 

= {3} u {1,2,4} = {4} U {1, 2, 3} 

= {1,2} U {3,4} = {1, 3} U {2,4} 

= {1,4} U {2, 3}. 

3 The Gamma Function and Related Functions 

The triangular numbers T n = 1 + 2 + ■ ■ ■ + n can be 
written as | n ( n + 1 ) . Euler thought that it must be pos- 
sible to express n! = 1 ■ 2 (n-l)-iiasa simple 

formula (the factorial notation was not used by Euler). 
In 1729 he proved that for n\ such a simple formula 
does not exist, but he did come up with the formula 
n\ = Jo (- lnx)” dx. Nowadays, this integral is written 
as 

r 00 

r(z) = t z-1 e -t dt, Re z > 0, (1) 

Jo 

which is obtained when we use the values t = - In x and 
n = z — 1. This notation was formulated by Legendre 
in 1809, and he also coined the term “gamma function” 
for T. 

The fundamental property T(z + 1) = zT(z) follows 
easily by integrating by parts in (1), and this relation 
shows that the singularities at z = 0,-1 - 2 ,... are 
poles of first order. From the Maclaurin expansion of 
e~ ( and the Prym decomposition 

rl r co 

F(z) = J f z_ 1 e _t df + J t z- 1 e _t dt, 

we obtain the expansion due to Mittag-Leffler 


r(z) = X 


n!(n + z) 


t z -i e -t dt 


where z 0, -1,-2, As the integral is an entire 

function of z, we see that the pole at z = -n has a 
residue (— 1 ) n /n\. 

There is another definition as an infinite product, 


n,(( i+ ik ! '”)' 


where y is Euler’s constant, defined by the limit 


y = lim ( f - - In n) = 0.5772 .... 
From the infinite product it follows that 

rl N I- n ' UZ 

r(z) = lim — — — r. 

n- “ z(z + 1) ■ ■ ■ (z + n) 


Important properties are the reflection formula (due 
to Euler (1771)) 

r(z)T(l-z) = — — — -, z + an integer, (3) 
sm(TTz) 

and the duplication formula 

T(i)T(2z) = 2 2z - 1 r(z)T(z+ i), 

where z ^ 0,-1, -2, . . . and F( |) = v 'tt. 

A closely related special function is the beta integral, 
defined for Re p > 0 and Re q > 0 by 


B(p,q ) = f t p 
Jo 


- 1)*- 1 df. 


The relationship between the beta integral and the 
gamma function is given by 

rip + q) 

The derivative of the gamma function itself does not 
play an important role in the theory and applications of 
special functions. It is not a very manageable function. 
Much more interesting is the logarithmic derivative: 

,p(z) = A lnr(z) = LM. 

It satisfies the recursion relation ip(z+ 1) = i//(z) + l/z 
and it has an interesting series expansion: 

i//(z) = -y + £ (^r-— ), 

„Vn+l z+n 

n = 0 

for z + 0, -1, -2 

Stirling's asymptotic formula for factorials 

n! ~ V2 rtn n n e~ n , n — co, (5) 

has several refinements, first in the form 

T(z) = V2 t r/z z z e~ z+f/(z \ p(z) = 9/z, (6) 

with 0 < 0 < 1, if z > 0, and second in the form 

00 

r (z) ~ -J2tt ! z z z er z X (7) 

n = 0 Z 

which is valid for large z inside the sector -n < ph z < 
tt (where phz = argz is the phase or argument of z). 
The first few coefficients are 


Aq — 1 , , 


n - 139 

"3 - 51 840 - 


For the asymptotic expansion of the logarithm of the 
gamma function the coefficients are explicitly known 
in terms of Bernoulli numbers: 

lnf(z) ~ (z - i) - z + i ln(2Tt) 


, y ( m 

^ 2n{2n - 1 )z 2ft_1 ’ 
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which is again valid for large z inside the sector - tt < 
phz < tt. This expansion is more efficient than the one 
in (7) because it is in powers of z -2 . Moreover, estimates 
of the remainder in the expansion in (8) are available. 

4 The Riemann Zeta Function 

The Riemann zeta function is defined by the series 

00 

C(s) =1^. Res > 1- (9) 

This function was known to Euler, but its main proper- 
ties were discovered by Riemann. A well-known result 
in analysis is the divergence of the harmonic series 
Xn=i(l/ n ) (see (2)), and indeed, the Riemann zeta 
function has a singular point at 5 = 1, a pole of order 1. 
The limit lim 5 -i(5 - 1)£(5) exists and equals 1. Apart 
from this pole, £(s) is analytic throughout the complex 
5-plane. 

The relationship between the Riemann zeta function 
and the gamma function is shown in the reflection 
formula: 

£(1 - 5) = 2(2n)~ s r(s)£(s) cos(|tt5), 5^0. (10) 

The most remarkable thing about £(s) is its relation- 
ship with the theory of prime numbers. We demon- 
strate one aspect of this relationship here. Assume that 
Re 5 > 1. Subtract the series for 2 _i £(s) from the one 
in (9). We then obtain 



Similarly, we obtain 

(1 _2-*)(l-3-*)C(5) = X^, 

where the summation now runs over n ^ 1, except for 
multiples of 2 and 3. Now, let p n denote the nth prime 
number, starting with p\ = 2. By repeating the above 
procedure we obtain 

£(5)fi(i- P -) = i+xi 

n= 1 

where the summation runs over integers n > 1, except 
for multiples of the primes p\,p 2 ,..., Pm- The sum of 
this series vanishes as m — • 00 (since p m — • 00 ). From 
this we obtain the result: 

1 

= n t 1 - Pn 5 )’ Res>l. (11) 

This formula is of fundamental importance to the rela- 
tionship between the Riemann zeta function and the 
theory of prime numbers. 


An immediate consequence is that £(s) does not 
have zeros in the half-plane Re 5 > 1. The reflection 
formula (10) makes it clear that the only zeros in the 

half-plane Re 5 < 0 occur at the points -2, -4, -6, 

These are called the trivial zeros of the zeta function. 

Riemamr conjectured that all the zeros in the strip 
0 ^ Re 5 ^ 1 (it is known that there are infinitely many 
of them) are located on the line Re 5 = 2 . This conjec- 
ture, the Riemann hypothesis, has not yet been proved 
or disproved. An important part of number theory is 
based on this conjecture. Much time has been spent on 
attempting to verify or disprove the Riemann hypoth- 
esis, both analytically and numerically. It is one of the 
seven Millennium Prize Problems of the Clay Mathemat- 
ics Institute, with a prize of $1 000 000 on offer for its 
resolution. 


5 Gauss Hypergeometric Functions 


The Gauss hypergeometric function, 


F(a,b\c - ,z) = 1 + 


ab 

z + 
c 


a(a+ 1 )b(b + 1) 2 
c(c +1)2! Z 


plays a central role in the theory of special functions. 

(We always assume that c 4= 0, — 1, -2, ) With more 

compact notation, 


F(a,b;c m ,z ) 


00 


I 

n = 0 


(d)n(b)n 

( c) n n\ 


\z\ < 1, 


( 12 ) 


where Pochhammer’s symbol (a) n is defined by 

_ na + 71) _ (-lFT(l-a) 

{a n F(a) r(l-a-n) ’ 

using (3). Many special cases arise in the area of orthog- 
onal polynomials (see section 7), in probability theory 
(as distribution functions: see section 6), and in physics 
(as Legendre functions: see section 10). 

When a = 0,-1, -2,..., the power series in (12) 
terminates and F(a,b\c\z) becomes a polynomial of 
degree -a. The same holds for b\ note the symmetry 
F(a,b\c\z) = F(b,a\c\z). 

Euler knew of the hypergeometric function, but 
Gauss made a more systematic study of it. The name 
“hypergeometric series” was introduced by John Wallis 
in 1655. 

The geometric series l/(l-z) = l + z + z 2 +-- - is 
the simplest example, and more generally we have 


(1 - z) a = y i — = F(a,b\b\z), |z| <1, (13) 

n u! 

n-0 
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for any b. Other elementary examples are 

F(l,l\l\z 2 ) = \ z 1 ln([l + z]/[l - z]), 


F(|,i;f;z 2 ) 

F(lhh~z 2 ) 


2" 

z _1 arctanz, 

-l . 


F(i,i;f;-z 2 ) = 

° z^arcsinz, 

< 2 ’ 2 ’ 2 ’~' s ' ’ ~ Z_1 l n ( z + Vl + Z 2 ), 
as well as complete elliptic integrals. We have 


rir/2 


A'(fe) = 


d0 


Jo Vl - k 2 sin 2 0 2 

for 0 ^ k 2 < 1. Similarly, for 0 ^ k 2 ^ 1 

(-7T/2 


= ^F(i,i;l;k 2 ) 


F(fe) 


fir /- I — 

Vl 

Jo 


k 2 sin 2 0 d0 = ^F(- 1, 1; fc 2 ). 


The integral representation 
F(c) 


F(a, fc;c;z) = 


F(b)F(c - b) 


1 

. o 


1 ( 1 - t) 


c-b-1 


(1 - tz)" a dt 


(14) 


can be used when Re c > Re fc > 0 and for |ph( 1 — z ) | < 
tt. This representation extends the z-domain of the 
function defined by the power series in (12) consid- 
erably. The relationship between the integral in (14) 
and the hypergeometric function follows if we expand 
(1 - tz)~ a as in (13) and using the beta integral (4). 
Several other integral representations along contours 
in the complex plane that have fewer or no restrictions 
on the parameters a, b, and c can be obtained from (14). 

The function F(a,b\c m ,z) satisfies the differential 
equation (given by Gauss) 

z( 1 - z)F" + (c - (a + b + 1 )z)F' - abF = 0. (15) 


From the theory of differential equations it follows that 
(15) has three regular singular points, at z = 0, z = 1, 
and z = oo . 

A function may be expressed in terms of hyper- 
geometric functions in several different ways. The 
example (1-z) -1 = F(l,h;fc;z) demonstrates the basic 
idea. We can write 


1 _ -1 
1 - z z ( 1 — 1/z) 


1 

z 


F(l, b\ b\ 1/z). 


(16) 


We therefore have a second representation of (1 - z)" 1 
but with a different domain of convergence of the 
power series, |z| > 1. In the general case, F(a,b\c\z) 
can be written as a linear combination of other F-func- 
tions, with different a, b, and c, and power series in z, 
1/z, 1 - z, z/(z - 1), 1/(1 - z), or 1 - 1/z. 


The simplest general set of such relations is 
F(a,b\c\z) = (1 - z)~ a F(a,c - b;c;z/(z— 1)) 

= (1 - zy b F(c - a, b-,c;z/(z — 1)) 

= (1 - zy a - b F(c - a,c - b\c\z). 

The first of these follows most easily from changing 
the variable of integration in (14), t •- (1 - t), the sec- 
ond from the symmetry in F(a,b\c\z) with respect to 
a and b, and the third from using the first or second 
relation twice. This gives another way of extending the 
z-domain of the function defined by the power series 
in (12). 

The Gauss hypergeometric function has many gener- 
alizations, of which we mention the most natural one, 
which takes the form 

p a [b 1 ,b 2 ,...,b cl ; Z ) 

(&l)n(ft 2 )n ‘ ‘ ' (ttp)n Z n 
) n (b 2 )n- ■ -(b q )n ui’ 

which is convergent for |z| < lifp = q + lor 
for all z when p ^ q. All bj should be different 

from 0, - 1, -2 Note that the Gauss hypergeometric 

function F = 2 Fi . 

6 Probability Functions 

An essential concept in probability theory is the cumu- 
lative distribution function F(x). It is defined on the 
real line, and it is nondecreasing with F(-co) = 0 and 
F( oo ) = 1. Other intervals are also used. The normal or 
Gaussian distribution 

P(x) = ’ f e- f2/2 dt 

V 2 7T J — oo 

is a major example. The error functions 

2 r x 2 2 f 00 2 

erf x = e~ f " dt, erfcx = — = e -t " dt 

yn Jo yff J x 

are also used, and the following relationship holds: 

erfx + erfcx = 1. P(x) and the complementary func- 

tionQ(x) = l-P(x) are related to erfc in the following 

ways: 

P(x) = \ erfc(-x/V2), Q(x) = ^erfc(x/V2). 

Error functions occur in other branches of applied 
mathematics, heat conduction, for example. Extra pa- 
rameters such as mean and variance can be included 
in the basic forms. An extensive introduction to prob- 
ability functions can be found in Johnson et al. (1994, 
1995). 
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More degrees of freedom are available in the incom- 
plete beta function ratio I x (p,q), which is based on the 
beta integral in (4) and is given by 

i x (p,q) = p , 1 , f 1 (1 - t)^ 1 df, 

B(p,q) Jo 

with x G [0,1] and p,q > 0. Special cases are the 
F-(variance-ratio) distribution and Student’s t-distribu- 
tion. The function I x (p,q ) can be written as a Gauss 
hypergeometric function. 

For the gamma distribution we split up the integral 
in (1) that defines the gamma function. This gives the 
incomplete gamma function ratios 

P(a,x) = — - — f t a ~ 1 e~ t dt, 
r(a) Jo 

1 f 00 

Q(a,x) = — — - {fl-tg-t dt, 

r { cl) Jx 

where a > 0. The functions P(a,x ) and Q(a,x) can be 
written in terms of confluent hypergeometric functions 
(see section 8). 

A further generalization is the noncentral x 2 distri- 
bution, which can be defined by the integral 

fT / z \(l u-l)/2 

P»(x,y) = y—J e~ x ~ z Iu-i(2y/xz)dz, (18) 

where 1^ is a modified Bessel function (see section 9). 
The complementary function is = 1 — P^, and this 
function also plays a role in physics, for instance in 
problems on radar communications, where it is called 
the generalized Marcum Q-function. 

In terms of the incomplete gamma function ratios we 
have the expansions 

“ x n 

P^(x,y) = e~ x X + n,y), 

n= 0 
00 

Q_v(x,y) = e~ x X Q(P + n,y). 

n-0 

In statistics and probability theory one is more famil- 
iar with the definition through the x 2 probability func- 
tions, which are defined by 

P(X 2 I v) = P(a,x), Q(X 2 I v) = Q(a,x), 

where v = 2 a and x 2 = 2x. The noncentral x 2 distri- 
bution functions are then defined by 

P(X 2 I v, A) = Y e~ A/2 ^ 2 1 P(x 2 I v + 2 n), 
n=o n! 

Q(X 2 I v, A) = X e~ A/2 ^ 2 ^ Q<x 2 I v + 2 n), 

n-0 

where A ^ 0 is called the noncentrality parameter. 


7 Orthogonal Polynomials 

These special functions arise in many branches of pure 
and applied mathematics. For example, the Hermite 
polynomials H n play a role in the form e~ x2/2 H n (x) as 
eigenfunctions of the Schrodinger equation for a linear 
harmonic oscillator. 

Let p n be a polynomial of degree n defined on 
ia,b), a real interval, where a = -oo and/or b = -too 
are allowed. Let w be a nonnegative weight function 
defined on ( a , b). Suppose that 

rb 

PnMp m (x)w(x) dx = 0 (19) 

J a 

if and only if n m. Then the family {p n } consti- 
tutes a system of orthogonal polynomials on (a,b) with 
respect to weight w. 

Orthogonal polynomials can also be defined with 
Lebesgue measures, on curves in the complex plane 
(such as the unit circle), and with respect to discrete 
weight functions or measures. In the last case, the 
integral in (19) becomes a sum. 

The families of polynomials associated with the 
names of Jacobi, Gegenbauer, Chebyshev, Legendre, 
Laguerre, and Hermite are called the classical orthog- 
onal polynomials. They share many features, and they 
have the following characteristics: 

(i) the family {p' n } is also an orthogonal system; 

(ii) the polynomial p n satisfies a second-order linear 
differential equation A(x)y" +B(x)v' + A n y = 0, 
where A and B do not depend on n and \ n does 
not depend on x; and 

(iii) there is a Rodrigues formula of the form 

1 d" 

Pnix) = — — (w(x)X n (x)), 

K n w(x) dx" 

where X(x) is a polynomial with coefficients not 
depending on n, and K n does not depend on x. 

These three properties are so characteristic that any 
system of orthogonal polynomials that has them can 
be reduced to classical orthogonal polynomials. 

Classical orthogonal polynomials satisfy recurrence 
relations of the form 

Pn+l (x ) = (a n x + bn)Pn(x) ~ C n pn- l(x) (20) 
for n = 1,2 

In 1815, Gauss introduced the use of orthogonal 
polynomials in numerical quadrature for the Legendre 
case a = -1, b = 1, w(x) = 1. The general formula 
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for a family {p n } of polynomials that are orthogonal 
on the interval (a, b) with weight w(x) is 

rb n 

w(x)f(x) dx = X Ak,nf(Xk,n ) + Rn, 

Ja k= 1 


where Xk.n (the nodes) are the zeros of p n . Several 
forms are available for the weights A k, n \ if the p n are 
orthonormal (J^ p 2 (x)ic(x) Ax = 1 ), 

^ n - 1 

T = X Pi (Xk,n)- 

A k,n j _ j 

The remainder R n = 0 for an arbitrary polynomial 
fix) = q m ix) with degree m not exceeding 2 n - 1. 

As an example, consider the Hermite polynomials 
H n [x). They have weight function er * 2 on (-00,00); 
they follow from the Rodrigues formula 

2 A n 2 

H n ix) = (-l) n e x ; 

they satisfy the differential equation y" - 2 xy' + 
2 ny = 0; they satisfy the recursion (20) with a n = 2, 
b„ = 0 , c n = 2 n and initial values Hoix) = 1 , Hi(x) = 
2 x; and they have the generating function 


ZlXZ-Z 2 


00 


X 

n = 0 


H n (X ) n 

— r~ z 

n\ 


Jacobi, Gegenbauer, Legendre, and Chebyshev poly- 
nomials are special cases of the Gauss hypergeometric 
function. Laguerre and Hermite polynomials are spe- 
cial cases of the confluent hypergeometric function (see 
section 8). 


8 Confluent Hypergeometric or 
Kummer Functions 


When we take b — ■ 00 in Fia,b\c,z/b), using the series 
(12) and lim n ^ooib) n /b n = 1, the result is iFi(a;c;z) 
(put p = q = 1 in ( 17 )): 


iFi(a;c;z) 


00 


X 

n= 0 


jg)n Z n 

(c)n n! ' 


( 21 ) 


This series converges for all complex z, with the usual 

exception c = 0 , — 1 , - 2 , When a = c, 1F1 (a; a\z) = 

e z . 

The function F(a,b\c',z/b) satisfies a differential 
equation with three regular singularities: at z = 0, z = 
b, and z = 00. In the limit as b — 00, two singularities 
merge. This limiting process is called a confluence , and 
so \F\ia\c\z) is called the confluent hypergeometric 
function. It satisfies the differential equation 


zF" + (c - z)F' - aF = 0 , ( 22 ) 


which has a regular singularity at z = 0 and an irregular 
singularity at z = 00 . 

We have the integral representation 

1 FHa-c-z) = ^ C) f 1 t a_1 (1 - t) c ~ a ~ l e zt df , 

Fia)ric - a) Jo 

which is valid when Re c > Re a > 0 . When we expand 
the exponential function, we obtain the series in (21). 
For a second solution of ( 22 ), we have 

Uia,c,z) = } [ t a_1 (l + t) c " a_1 e _zl df, 

ria) Jo 

where we assume Re z > 0 and Re a > 0 . 

The functions iFi(a;c;z) and U(a,c,z ) are named 
after Kummer. Special cases are Coulomb functions, 
Laguerre polynomials, Bessel functions, parabolic cylin- 
der functions, incomplete gamma functions, Fresnel 
integrals, error functions, and exponential integrals. 

The Whittaker functions are an alternative pair of 
Kummer functions and they have the following defini- 
tions: 

M K ^iz) = e~ zl 2 z ll 2 +ii iFii\ + p - k\ 1 + 2 p;z), 
W Kll j(z) = eT zl 2 z ll 2 +ii Ul\ + p - k, 1 + 2 p,z). 
These functions satisfy the differential equation 

w"+(-] + - + ^^)w = 0 . 

V 4 z z 2 / 

Solutions of the differential equation 

w" - i\z 2 + a)w = 0 ( 23 ) 

can be expressed in terms of Kummer functions, but 
they are often called Weber parabolic cylinder func- 
tions, after Heinrich Weber, because they arise when 
solving the Laplace equation AV = 0 by separating 
the variables into parabolic cylindrical coordinates, 
(§, p,z). These parabolic cylindrical coordinates are 
related to rectangular coordinates (x, y, z) through the 
equations x = jc(§ 2 - q 2 ) and y = cF,q, where c is a 
scale factor. When 5 or q are kept constant, say § = 5 o 
or q = qo, then we have 

y 2 = -c§q(2x - £§5), y 2 = cqlilx - cql), 

which are parabolas with foci at the origin. 

Solutions of ( 23 ) can first be written in terms of Kum- 
mer functions, and then linear combinations give the 
Weber functions, Uia, ±z) and Via, ±z). One can also 
use the notation D v (z) = U(- 2 - v,z); this function 
can be written in terms of a Hermite polynomial when 
v = 0, 1,2 
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9 Bessel Functions 


Bessel functions show up in many physics and engi- 
neering problems, in Fourier theory and abstract har- 
monic analysis, and in statistics and probability theory. 
Most frequently they occur in connection with differen- 
tial equations. The earliest systematic study was under- 
taken by Bessel in 1824 for a problem connected with 
planetary motion. For historical notes and an extensive 
treatment, see Watson (1944). 

We have seen the modified Bessel function l^(z) in 
the noncentral x 2 distribution, (18). In mathematical 
physics Bessel functions are most commonly associ- 
ated with the partial differential equations of the poten- 
tial problem, wave motion, or diffusion, in cylindrical or 
spherical coordinates. By separating the variables with 
respect to these coordinates in the time-independent 
wave equation (the Helmholtz equation) Av + k 2 v = 0, 
several differential equations are obtained and one of 
them can be put in the form of the Bessel differential 
equation : 

z 2 w" + zw' + (z 2 - v 2 )w = 0. (24) 


Proper normalizations and combinations of solutions 
of (24) give the ordinary Bessel functions 

jv(z), Yy(z), Hi v (z), h| 2) (z), (25) 


also called cylinder functions, where v is the order of 
these functions. The modified Bessel functions J v (z) 
and A'y (z) follow from these with z replaced by ±iz. 

In physical problems with circular or cylindrical sym- 
metry Bessel functions with order v = n, an inte- 
ger, are used, while in spherical coordinates the Bessel 
functions arise with v = n + 

Bessel functions of order ±| with argument |z 3/2 
are named Airy functions after George Biddell Airy, 
a British astronomer, who used them when studying 
rainbow phenomena. Airy functions are solutions of 
the differential equation w" = ziv. The real solutions 
are oscillatory for z < 0 and exponential for z > 0. 
Airy’s equation is the simplest second-order linear dif- 
ferential equation that shows such a turning point (at 
z = 0). 

Other special functions with turning-point behav- 
ior can be approximated in terms of Airy functions. 
The function i v(z) = *JzC v (vz), where C v (z) is any 
cylinder function that appears in (25), satisfies the 
differential equation 


iv 


n 


+ 



1 - z 2 
z 2 



iv = 0. 


For large values of v this equation has turning points 
at z = ±1. In fact, the cylinder functions C v (z) of 
large order v show turning-point behavior at z = v. 
Airy functions can be used to give powerful asymptotic 
approximations (Olver 1997). 


10 Legendre Functions 

The associated Legendre functions Pv (z) and Qv(z) 
satisfy Legendre's differential equation 

(1 - z 2 )w" - 2 zw' + |v(v + 1) - ^ ) w = 0. (26) 

This equation has regular singularities at ±1 and oo, 
and it can therefore be transformed into the equation 
of the Gauss hypergeometric functions (15). In fact, 

Pl(z) = r( ^ ) f(-v.v + l;l |z), (27) 

where £ = (z + 1 ) / (z — 1) and the branch cut is such 
that ph^ = 0 if z G (1, oo). Whenz = x G (—1,1), areal 
solution is defined by replacing £ with (1 + x)/(l - x) 
in (27). 

We describe a few special cases that are relevant in 
boundary-value problems for special choices of Legen- 
dre functions and for specific domains: spheres, cones, 
and tori. Many problems in other domains, however, 
such as in a spheroid or a hyperboloid of revolution, 
can be solved using Legendre functions. 


10.1 Spherical Harmonics 


These functions arise in a variety of applications, in par- 
ticular in investigating gravitational wave signals from 
a pulsar and in tomographic reconstruction of signals 
from outer space. 

First we consider the subclass of associated Legendre 
functions defined by the Rodrigues-type formula 


P™(x) = (-l) f 


(l-x 2 ) m/2 d n+m ,_ 2 


2 n nl dx n+m 


(x 2 -iy 


where n, m are nonnegative integers and x G (-1,1). 
For fixed order m they are orthogonal with respect to 
the degree n, with weight function w(x) = lon(-l,l). 
For m = 0 they become the well-known Legendre 
polynomials, P n (x). 

The P™(x) are used in representations of spherical 
harmonics, which are given by 


Y™ (6, cf>) = N™P™ (cos (28) 


where N ™ is used for normalization and does not 
depend on 0 or <p. These variables represent colatitude 
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and longitude, respectively, on the sphere in the inter- 
vals 6 e [0, tt] (from north pole to south pole) and 
4> G [0, 2tt) (also called the azimuth). 

The general bounded solution to Laplace’s equation, 
A / = 0, inside the unit sphere centered at the origin, 
r = 0, is a linear combination 

oo n 

f(y,e,<t>) = X I f™r n Y™(d,ct>). (29) 
n=0 m=-n 

The constants f™ can be computed when bound- 
ary values on the sphere are given using orthogo- 
nality relationships for P™ (x) and the trigonomet- 
ric elements. When the boundary values are given as 
a real square-integrable function, the expansion (29) 
converges inside the unit sphere. 

10.2 Conical Functions 

Conical functions f- i/2+1t(*) and Q.-i/ 2 +ItM appear 
in a large number of applications in engineering, 
applied physics, cosmology, and quantum physics. 
They occur in boundary-value problems involving con- 
figurations of a conical shape. They are also the ker- 
nel of the Mehler-Fock transform, which has numerous 
applications. The functions were introduced by Gustav 
Ferdinand Mehler in 1868, when he was working with 
series that express the distance of a point on the axis 
of a cone to a point located on the surface of the cone. 
The integral representation 

P- i/ 2 +ir(cos 0 ) 

j~2 (sind)^ f e cosh(Tt)dt 
V Tt T(i - p) Jo (cost - COS 0)t' + !/2’ 

with real t, 0 < 0 < tt, and Re p < I , shows that this 
function is real for these values of the parameters. 

The specific focus in physics is on the case when p = 
m, an integer. When this is the case, we can use 

pm i/ 2 +iT( x ) = <5 +iT ) m (2 ~iT) m Pl?f 2+lT (x) 

to obtain representations for all m. 

10.3 Toroidal Harmonics 

Toroidal harmonics are used to solve the potential 
problem in a region bounded by a torus. The toroidal 
coordinates (§, q, <£), with 0 ^ < oo, -tt < q, <fi ^ TT, 

are related to the rectangular coordinates x, y, z (or 
r, <p, z in cylindrical coordinates) through x = r cos <p, 
y = r sin <p, 

c sinh F sin q 

y — 2 = y 

cosh 5 - cos q ’ sinh § ’ 


where c is a scale factor. The coordinates are chosen so 
that 5 = §o represents the toroidal surface. 

The general solution of AT = 0 in toroidal coordin- 
ates can be written in the form 

y(5,q,0) 

x X cos n(q - r\ mn ) cos m(<p - 4> mn) 
n,m = 0 

X (A mn P” i 1/2 (cosh§) + BmnQn - 1/2 (cosh §)), 

where A mn , B mn , rfmn, and 4 y mn have to be determined 
from the boundary condition. 

1 1 Functions from Other 

Boundary-Value Problems 

The special functions mentioned in sections 5-10 are 
all of hypergeometric type. All these functions were 
known by 1850, usually as a result of the interaction 
between mathematicians and physicists. In 1868, Math- 
ieu used a different curvilinear coordinate system from 
those used up to that point when he considered ellip- 
tic cylinder coordinates. Other, more general systems 
(such as oblate and prolate spheroidal and ellipsoidal 
systems) were introduced soon after. 

Actually, there are eleven three-dimensional coordi- 
nate systems in which the time-independent wave equa- 
tion, Av + k 2 v = 0, is separable. The Laplace equation, 
Av = 0, is separable in two more systems (the bipolar 
and bispherical systems). 

The systems introduced since the time of Mathieu 
can be solved in terms of special functions, but these 
functions are not of hypergeometric type. 

MATHIEU'S DIFFERENTIAL EQUATION [111.21] is 

iv" + (A - 2 h 2 cos 2x)w = 0. 

The substitution t = cosx transforms this equation 
into an algebraic form that has two regular singularities 
at f = ± 1 and one irregular singularity at infinity. This 
implies that, in general, the solutions of Mathieu’ s equa- 
tion cannot be expressed in terms of hypergeometric 
functions. 

Another special feature is that, in general, no explicit 
representations of solutions of Mathieu’s equation 
(in the form of integrals, say) are known. Nor can 
explicit power-series expansions or Fourier expansions 
be derived: the coefficients are solutions of three-term 
recurrence relations, and with a given value of h, con- 
vergent expansions can be obtained for an infinite num- 
ber of eigenvalues A. These features also apply for other 
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differential equations that follow from the boundary- 
value problems with solutions that are beyond the class 
of hypergeometric functions. 


12 Painleve Transcendents 


For linear second-order differential equations the loca- 
tion and nature of the singularities of the equation 
can be investigated easily; we need only know the 
coefficients in the differential equations. 

For nonlinear equations this is not the case, as we see 
from the simple example y' = 1 + y 2 . No singularities 
occur in the coefficients of this equation. The solution 
y = tan(x-xo), however, has polesgalore. In the exam- 
ple y' = -y 2 , we have the solution y = l/(x - xo), 
where xq is again a free constant. 

The properties of these nonlinear equations are also 
found in painleve equations [III. 2 4], a topic that has 
attracted many researchers in recent decades. The solu- 
tions of the simple equations just mentioned are ana- 
lytic except for poles. In the context of Painleve equa- 
tions, the poles of y = tan(x - xo) and y = 1/ (x - xo) 
are called movable poles: their locations change accord- 
ing to the initial values. Types of singularities other 
than poles, such as branch points and logarithmic 
singularities, may occur in other examples. 

In 1900, Painleve found the six equations now named 
after him by classifying second-order ordinary differen- 
tial equations in a certain class. Painleve equations have 
the property that all movable singularities in the com- 
plex plane of all solutions are poles, and this property 
is called the Painleve property. The first three equations 
are 


y" = 6 y 2 + x, y" = 2 y 3 + xy + a, 


y 


n 


(. y') 2 _ y 

y x 


1 

x 


( (xy 2 + p) + yy 3 + 


_5 

y’ 


where a, /I, y, and 5 are constants. 

We have already discussed the important role that 
special functions (Bessel, Kummer, and so on) play in 
mathematical physics as solutions of linear differential 
equations. The Painleve transcendents play an analo- 
gous role for nonlinear ordinary differential equations, 
and applications can be found in many areas of physics 
including nonlinear waves, plasma physics, statistical 
mechanics, nonlinear optics, and fibre optics. 


13 Concluding Remarks 

We have given an overview of a selection of the clas- 
sical special functions, but there are many other func- 


tions we could have chosen to look at. We mention a 
few other important topics. 

The theory of Lie groups, and in particular their rep- 
resentation theory, has shown how special functions 
can be interpreted from a completely different point 
of view. In the setting of q-functions, difference equa- 
tions become a source for special functions. In recent 
decades we have seen a boom in ij-hypergeometric 
function research. 

Other areas of active research are the study of Jacobi 
elliptic functions, the study of theta functions, and 
the study of Weierstrass elliptic and related functions. 
Since the turn of the century the relationship between 
the theory of elliptic functions and the theory of ellip- 
tic curves has been extensively explored, and this rela- 
tionship was used by Andrew Wiles to prove Fermat’s 
last theorem. A class of elliptic curves is used in some 
cryptographic applications as well as for integer fac- 
torization. Applications of theta functions are found in 
physics, where they are used as solutions of diffusion 
equations. 

In the further reading list below we mention only a 
few key works. In the NIST Handbook of Mathematical 
Functions, nearly all of the formulas we have discussed 
are given with extensive references to the relevant lit- 
erature, including where to find a proof. Graphs of the 
functions are also shown. 
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IV. 8 Spectral Theory 

E. Brian Davies 


1 Introduction 

Applications of spectral theory arise in a wide range of 
areas in applied mathematics, physics, and engineer- 
ing, but there are also applications to geometry, prob- 
ability, and many other fields. The origins of spectral 
theory can be traced to Laplace at the start of the nine- 
teenth century and, if one includes its connections with 
musical harmonies, back to Pythagoras 2500 years ago. 
Its present-day manifestations cover everything from 
the design and testing of mechanical structures to the 
algorithms that are used in the Google search engine. 
There is an extensive mathematical theory behind the 
subject, and there are computer packages (some highly 
specialized but others, such as MATLAB, of a more gen- 
eral character) that enable scientists and engineers to 
determine the spectral features of the problems that 
they are studying. Spectral theory is often important 
for its own sake, but it can also be used as a way of writ- 
ing the solution of a problem as a linear combination 
of the solutions of a sequence of eigenvalue problems. 

The word eigenvalue is a combination of the German 
root “eigen,” meaning characteristic or distinctive, and 
the English word “value,” which suggests that one is 
seeking a numerical quantity A that is naturally associ- 
ated with the application being considered. This num- 
ber is often interpreted as the energy or frequency of 
some nonzero solution / of an equation of the form 

Lf = A/. (1) 

The symbol L refers to the linear operator that de- 
scribes the particular problem being considered. The 
solution / of the equation is called the eigenfunc- 
tion (or eigenvector, depending on the context) corre- 
sponding to the eigenvalue A, and it may be a func- 
tion of one or more variables. Although much of the 
literature concentrates on determining the eigenvalues 
of some model, the eigenfunctions carry much more 
information. 

Depending on the particular evolution equation in- 
volved, if an eigenvalue A = u + iv associated with 
some problem is complex, then u is interpreted as the 
frequency of oscillation or vibration of the system being 
studied. If v < 0 then the vibration being considered is 
stable and v is interpreted as its rate of decay. If v > 0 


then the vibration is unstable and the size of v deter- 
mines how unstable; unstable vibrations in engineering 
structures can lead to catastrophic failure. 

We conclude the introduction by mentioning inverse 
problems [IV. 15], in which one seeks to determine 
important features of some problem from measure- 
ments of its associated spectrum. This field has a 
wide variety of important applications, ranging from 
engineering to seismology. 

2 Some Applications 

In this section we describe a few applications of spec- 
tral theory. These applications raise issues of great 
importance in their respective subjects as presently 
practised. Further applications are mentioned at the 
end of the section, but these by no means exhaust the 
possibilities. 

Many physical structures, from violin strings and 
drums to turbines and skyscrapers, vibrate under suit- 
able circumstances, and in general the larger the struc- 
ture, the more slowly it vibrates. It is often important to 
know what the precise frequency of vibration is going 
to be before manufacturing a structure because adjust- 
ments after construction may be very expensive or even 
impossible. Calculating the frequencies of the impor- 
tant modes of vibration is now a well-developed branch 
of engineering. Nevertheless, mistakes are occasionally 
made. 

As an example we mention the Millennium Bridge, 
which crosses the Thames river in London and was 
opened on June 10, 2000. Two days later it was closed, 
for two years, because of an unexpected flaw in its 
design. The problem was that when a large number 
of pedestrians were on the bridge, they experienced a 
lateral wobble, and this wobble caused them to adjust 
their steps in order to stay upright. Unfortunately, this 
caused them to start walking in synchrony, with the 
effect that the lateral vibrations were enhanced until 
they became potentially dangerous— to the people on 
the bridge rather than to the bridge itself. The modeling 
of the vibrations of the bridge had clearly not been ade- 
quate. The problem was eventually resolved by insert- 
ing a number of specially designed viscous dampers 
underneath the structure. In spectral terms the effect 
was to move the eigenvalues further away from the real 
axis so that there was still a damping effect even when 
the oscillations were driven by pedestrian effects. This 
solved the problem, but at a cost of about £5 million. 
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The design of aircraft involves similar problems but 
in a much more serious form. Since the cost of a mod- 
ern airliner can be over 100 million dollars, it is imper- 
ative that unstable vibrations in the wings and control 
surfaces be anticipated and eliminated before manu- 
facturing starts. Aircraft have to operate over a wide 
range of speeds and altitudes and with different load- 
ings. Because an airliner uses so much fuel, its land- 
ing weight can be a third less than it was when tak- 
ing off. The relevant eigenvalue calculations are time- 
consuming and expensive, and one cannot repeat them 
for every possible set of parameters that might be 
encountered in operation. The best that one can do 
is to try to ensure that all of the relevant eigenvalues 
have large enough negative imaginary parts for a range 
of plausible values of the operational parameters. For 
commercial airliners, going too far in this direction 
reduces the efficiency of the design and hence the 
profitability. 

We turn to applications of spectral theory in chem- 
istry. In the nineteenth century, Joseph von Fraunhofer 
and others discovered that if one heated a chemical ele- 
ment until it glowed and then passed the light emit- 
ted through a spectroscope, one could see a series of 
sharp lines. The frequencies of these lines could be 
measured precisely, and they could be used to identify 
the elements involved. 

The spectroscopic lines for hydrogen have frequen- 
cies of the form 



where m and n are positive integers, but no reason was 
known for the validity of this formula until quantum 
theory was invented in 1925/26; we shall refer below to 
energies, which are proportional to frequencies in this 
context. It was found that the element hydrogen was 
described by a differential operator whose (discrete, or 
bound state) eigenvalues were given by A n = -c/n 2 , 
where c is the same constant as above. The spectro- 
scopic energies are not the eigenvalues of the oper- 
ator but differences of eigenvalues, the reason being 
that they measure the energy of the photon emitted 
by an atom when it makes a transition between one 
energy level and another. By the law of conservation 
of energy, this equals the difference between the two 
relevant energy levels of the hydrogen atom. 

The new quantum theory [IV.23] was fully accepted 
when the energy levels of helium were calculated 
and found to be in agreement with spectroscopic 
observations. Since that time, tens of thousands of 


spectral lines of elements and chemical compounds 
have been calculated and observed. If relativistic effects 
are included in the models, the calculations agree with 
observation in great detail. 

The fact that such calculations are now possible 
depends on the astonishing growth of computer power 
over the last fifty years. In 1970 theoretical chemists 
were often dismissed by “real” chemists, who knew 
that their problems would never be solved by purely 
theoretical methods. Some people persisted in spite of 
this discouragement, and in the end two theoretical 
chemists, Walter Kohn and John Pople, were awarded 
a joint Nobel Prize in 1998 for their work developing 
computational quantum chemistry over three decades. 
The hard grind of these pioneers has now placed them 
at the center of chemistry and molecular biology. 

One of the stranger “applications” of spectral theory 
is the apparent connection between the distribution of 
the zeros of the Riemann zeta function and the dis- 
tribution of the eigenvalues of a large random self- 
adjoint matrix. There is no known rigorous argument 
relating the zeros of the Riemann zeta function to the 
eigenvalues of any self-adjoint matrix or operator, but 
the numerical similarities observed in the two fields 
have led to a number of deep conjectures, some of 
which have been proved rigorously. There is no basis 
for assuming that this line of investigation will lead 
to a proof of the Riemann hypothesis, but anything 
that prompts worthwhile conjectures must be taken 
seriously. 

Finally, we mention that this volume contains a sep- 
arate article on random-matrix theory [IV.24], in 
which spectral theory is only one of several tech- 
niques involved. We have also avoided any discus- 
sion of spectral issues in fluid mechanics [IV.28] and 
SOLID MECHANICS [IV.32]. 

3 The Mathematical Context 

Spectral theory involves finding the spectra of linear 
operators that approximate the behavior of a variety of 
systems; we mention some nonlinear eigenvalue prob- 
lems in the final section. The subject has a body of gen- 
eral theory, some of which is quite difficult, but also a 
large range of techniques that are applicable only in 
particular cases. These are constantly being refined, 
and completely new ways of approaching problems 
also appear at irregular intervals. 1 


1. A variety of recent developments in the subject are described in 
Davies (2007). 
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Applications of spectral theory require the introduc- 
tion of appropriate mathematical models. The choice of 
model always involves a compromise between simplic- 
ity and accuracy. Pure mathematicians are more inter- 
ested in obtaining insights that can be applied to a 
range of similar operators, and they tend to consider 
simple generic models. Obtaining insights has tradi- 
tionally been associated with the proof of theorems, 
but experimental mathematicians also obtain insights 
by testing a wide range of examples using appropriate 
software. The ultimate goal of numerical analysts is to 
produce software that can be used by a wide variety of 
people who have no interest in how it works and who 
may use it for purposes that were never envisaged when 
it was written. On the other hand, applied mathemati- 
cians and physicists may be willing to spend months 
or even years studying particular problems that are of 
crucial importance for their group of researchers. 

3.1 Finite Matrices 

We start by considering the simplest case, in which one 
wishes to solve Af = A/, where A is a general nx n 
matrix and / is a column vector with length n. The 
existence of a nonzero solution / is equivalent to A 
being a solution of the equation 

det(A - A/) = 0. 

This is a polynomial equation of degree n, so it must 
have n solutions Ai, . . . , A n by the fundamental theo- 
rem of algebra. The simplest case arises when these 
solutions are all different. If this happens, let v r be 
an eigenvector associated with A r for each r. Then 
Vi, . . . , v n is a basis for C” and A has a diagonal matrix 
with respect to this basis. 

In general, no such basis exists. At the other extreme, 
the Jordan block 



1 

0 

0\ 

0 

c 

1 

0 

0 

0 

c 

1 

\o 

0 

0 

c) 


has only one eigenvalue, namely c, and only one eigen- 
vector, up to scalar multiples. Every finite matrix is 
similar to a sum of such Jordan blocks (see JORDAN 
CANONICAL FORM [11.22]). 

It used to be argued that such pathologies are irrel- 
evant in the real world; generically, all roots of a poly- 
nomial are different, and hence all eigenvalues of an 
n x n matrix are generically distinct. Unfortunately, 
highly non-self-adjoint matrices of even moderate size 


are often ill-conditioned, in the sense that the attempt 
to diagonalize them leads to highly unstable computa- 
tions, in spite of the fact that the eigenvalues are all dis- 
tinct. Numerical analysts have to take ill-conditioning 
into account for all algorithms that might be applied 
to non-self-adjoint matrices. One of the ways of doing 
this involves pseudospectral theory, a new branch of 
the subject that investigates the implications of, and 
connections between, different types of ill-conditioning 
(see Trefethen and Embree 200S). 

Problems of the above type do not occur for self- 
adjoint or normal matrices, i.e., those that satisfy A = 
A* or AA* = A* A, respectively. In both cases one has 

|| (AI - A)” 1 j| = dist(A, Spec(A)) -1 

for all A that do not lie in the spectrum Spec (A) of A, 
provided || ■ || is the operator norm, defined below. This 
is in sharp contrast to the general case, in which the left- 
hand side may be vastly bigger than the right-hand side, 
even for fairly small and “reasonable” matrices. In such 
cases, it has been argued that the value of determining 
the spectrum is considerably reduced. 

3.2 Formalism 

Most of the operators considered in spectral theory act 
in infinite-dimensional vector spaces over the complex 
number field. The detailed study of such operators is 
rather technical, not because analysts like technicali- 
ties but because spectra can have properties that do 
not conform to one’s naive intuitions. One starts with a 
complex HILBERT OR BANACH space [1.2 §19.4] B, with 
norm || ■ ||. A linear operator A: B — S is said to be 
bounded if its operator norm 

II A|| = sup{||A/|| : ||/|| ^ 1} 

is finite. One says that A e C lies in the spectrum 
Spec (A) of A if AI -A does not have a two-sided inverse 
in the set of all bounded operators on B. The spectrum 
of a bounded linear operator is always closed, bounded, 
and nonempty. It contains any eigenvalues that A might 
have, but can be larger than the set of all eigenvalues. 
The spectral radius of A is defined by 

p(A) = max{|A| : A 6 Spec(A)}, 

and it satisfies 

p(A) = lim \\A n \\ 1/n ^ || A|| . 

n— oo 

The above definitions are not directly applicable to 
differential operators or other unbounded linear oper- 
ators. In such cases one considers a linear operator 



IV. 8. Spectral Theory 


239 


A: D — ■ B, where the domain T) of A is a norm-dense 
linear subspace of T. One needs to assume or prove that 
A is closed in the sense that, if lim tl _c» || f n - f\\ =0 and 
limn^oo II Af n - g || =0, then /el) and Af = g. The 
definition of the spectrum of an unbounded operator 
is similar to that in the bounded case but also involves 
reference to its domain. The spectrum of an unbounded 
linear operator may be empty and it may equal C, but 
it is always a closed set. 

The spectrum of a linear operator is often hard to 
determine. The fact that the spectrum of an arbitrar- 
ily small perturbation of A may differ radically from 
the spectrum of A implies that one has to be extremely 
cautious about using numerical methods to determine 
spectra unless one knows that this extreme sensitivity 
does not arise for the operator of interest. A simple 
example of this phenomenon is obtained as follows. 
Given s e M, one defines the bounded linear operator 
A s acting on the space / 2 (Z) by 


( A s f)n 


\sfn + 1 if n = 0, 
[fn+i otherwise. 


Some routine calculations establish that Spec(A 5 ) = 
{z: |z| = 1} if s / 0 but Spec(Ao) = {z: |z| ^ 1}. 
Indeed, every z such that |z| < 1 is an eigenvalue of 
Ao, in the sense that the corresponding eigenvector lies 

in f? 2 (Z). 


3.3 Self-adjoint Operators 

A bounded linear operator L acting on a Hilbert space 
3-f with inner product ( ■ , ■ > is said to be self-adjoint if 

( Av,iv ) = { v,Aw ) (2) 

for all v, w e J-f. The definition of self-adjointness 
for unbounded operators is more technical. It implies 
that (2) holds for all v, w in the domain of A, but it is 
not implied by that condition, which is called symmetry 
by specialists. The difference between self-adjointness 
and symmetry was emphasized by John von Neumann 
in the early days of quantum mechanics. It is often 
evident that an operator of interest is symmetric, but 
proving that it is self-adjoint can be very hard; indeed, 
much of the literature before 1970 was devoted to this 
question. The spectral theory of self-adjoint operators 
is much more detailed and well understood than that 
of non-self-adjoint operators. 

The center point of the self-adjoint theory is the 
spectral theorem, which was proved around 1930 by 
von Neumann. There are various statements of this, 
some easier to understand than others. Perhaps the 


simplest is the statement that, if H is a self-adjoint 
operator on the abstract Hilbert space 3-f, then there 
exists a unitary operator U from I 2 (A, dx) to 3f for 
some measure space ( X , dx) and a function /: X — ■ R 
such that 

(U^HUQHx) =f(x)4>(x) 

for all <f> e I 2 (A', dx) and almost every x e X. 
Expressed more simply, H is unitarily equivalent to the 
multiplication operator associated with /. This is an 
analogue of the spectral theorem for self-adjoint matri- 
ces because multiplication operators are the infinite- 
dimensional analogue of diagonal matrices. The spec- 
tral theorem implies that the spectrum of H is real; 
in particular, every eigenvalue of H is real. The impli- 
cations of this theorem are wide-ranging: it reduces 
proofs of many of the properties of self-adjoint opera- 
tors to the status of obvious trivialities, and its absence 
makes the spectral theory of non-self-adjoint operators 
far less transparent. 

Another version of the spectral theorem introduces 
the functional calculus formula 

H= f AdP(A), 

Jk 

where P ( A) is a spectral projection of H for every choice 
of A £ 1. This leads to the formula 

f(H) = f /(A)dP(A) 

JK 

for a wide variety of bounded and unbounded functions 
/ provided one masters the technicalities involved. 
This version is more attractive, at least to mathemati- 
cians, than the previous one, but it is somewhat less 
easy to use. 

If the spectrum of the operator H consists of a count- 
able set of eigenvalues A n , where n e N, then the 
spectral theorem may also be written in the form 

00 

H — ^ A nPn- 

n = 1 

In this formula each P n is an orthogonal projection 
whose range is the subspace consisting of all eigenvec- 
tors associated with the eigenvalue A n . The rank of P n 
equals the multiplicity of the eigenvalue A n . The projec- 
tions are orthogonal in the sense that P m P n = PnPm = 
0 if m / n, and X«=i Pn = I- 

A more general formulation of the spectral theorem 
applies to several commuting self-adjoint operators 
simultaneously. However, there is nothing analogous 
for even two self-adjoint operators that do not com- 
mute. Von Neumann’s theory of operator algebras was 
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an attempt to understand the complexities involved in 
the study of noncommuting families of operators. 

3.4 Classification of the Spectrum 

In infinite dimensions the spectrum of a self-adjoint 
operator may be divided into parts that have different 
qualitative features. These include the absolutely con- 
tinuous spectrum, the singular continuous spectrum, 
the essential spectrum, the point spectrum, and the dis- 
crete spectrum. The definitions are technical and will be 
avoided here except for the following. One says that A 
lies in the discrete spectrum of a self-adjoint operator 
H if it is an isolated eigenvalue with finite multiplicity. 
The rest of the spectrum is called the essential spec- 
trum. For many operators the essential spectrum coin- 
cides with the continuous spectrum, but this is not true 
for the Anderson model, which is discussed below. 

For non-self-adjoint operators the situation is even 
more complicated. Spectral theorists are aware that 
there are several distinct definitions of the essential 
spectrum, but many have contented themselves with 
using only one. This suffices for many purposes, but in 
some situations the others are of real importance. 

Let H be a self-adjoint differential operator acting 
in L 2 (R n ). For many hut not all such operators, one 
can divide classical solutions of the differential equa- 
tion Hg = A g into three types. For some A, g may 
decay rapidly at infinity and A then lies in the discrete 
spectrum of H. For other A this may be false, but g 
is bounded at infinity and A lies in the essential spec- 
trum of H. If neither case pertains for any solution g of 
Hg = ^g, then A does not lie in the spectrum of H. The 
discrete spectrum of H is a finite or countable set, while 
the essential spectrum is typically an infinite interval or 
a union of several disjoint intervals. In many cases the 
essential spectrum can be determined in closed form 
by using perturbation arguments, but the analysis of 
the discrete spectrum is almost always harder, whether 
one approaches this task from the theoretical end or 
the computational one. 

A further issue is that the spectrum of an opera- 
tor can depend on the Banach space in which it oper- 
ates. For example, the spectrum of the Laplace-Beltrami 
operator on the hyperbolic space H n of dimension n j> 
2 depends in an essential way on p when considered 
as acting in L p (H n ). This phenomenon is much less 
common for differential operators acting in L 2 


3.5 The Variational Approach 

One can obtain valuable upper and lower bounds on 
the eigenvalues of many self-adjoint operators by using 
variational methods, which go back to Rayleigh and 
Ritz in the early years of the twentieth century (see 
variational principle [11.35]). The simplest context 
assumes that H has a complete orthonormal sequence 
of eigenvectors 4>„, where n 6 N, that the correspond- 
ing eigenvalues satisfy A„ ^ A n+ i for all n, and that 
lim fl _oo A„ — +oo. 

The variational formula involves the set £ of all finite- 
dimensional subspaces L of the domain of H. It is 
not directly useful, but it leads to rigorous spectral 
inequalities that have great value. Given! e £, one first 
defines 

A(I) = sup{(H/,/) : / G I and ||/|| = 1}. 

The variational formula is then 

A„ = inf{A(I): dim(I) = n}. 

Applications of the variational formula depend on com- 
paring the eigenvalues of two self-adjoint operators. 
If H\ and H 2 have the same domain, then one wriies 
Hi ^ H 2 if (Hi/,/) ^ <H 2 /,/> for all / in the com- 
mon domain. Many important applications require one 
to define the notion Hi ^ H 2 even when they do not 
have the same domain, and this can be done by using 
the theory of quadratic forms. 

It follows immediately from the variational formula 
for the eigenvalues that Hi ^ H 2 implies that An' ^ 
An* for all n, using an obvious notation. If the eigen- 
values of Hi or H 2 are already known, this leads to rig- 
orous upper or lower bounds on the eigenvalues of the 
other operator. 

This idea can be recast in several ways, and it allows 
one to obtain rigorous numerical upper and lower 
bounds on the eigenvalues of some types of self-adjoint 
operator. It also has the following theoretical conse- 
quence. Let O r , r = 1,2, be two bounded regions on R N 
such that Q 2 s Qi , and let H r = -A act in L 2 (O r ) sub- 
ject to Dirichlet boundary conditions. The eigenvalues 
of the two operators then satisfy A^ 11 ^ An 1 for all n, 
as above. This is called domain monotonicity and is one 
ingredient in the proof of Weyl’s law, which dates from 
1913 and states the following. 

Let H be the Laplacian acting in L 2 (Q) subject to 
Dirichlet boundary conditions, where £1 is any bounded 
region in WL N . Let 

N a (s) = #{A e Spec(H) : A ^ 5 }, 



IV. 8. Spectral Theory 


241 


where each eigenvalue is counted according to its 
multiplicity. Then 

Nh(s) = c N s N/2 \Q\ +o(s N/2 ) 
as 5 — ■ oo , where 

c N = (4nr NI2 r(N/2 + IT 1 , 

and T is the Gamma function. 

A rigorous analysis of the next term in the asymp- 
totic expansion of Nh(s) was given by Victor Ivrii and 
Richard Melrose in the above context, but the general 
solution of this problem for a much wider class of oper- 
ators was obtained only in the 1980s by Safarov and 
Vassiliev (1996). 

Unfortunately, the domain monotonicity mentioned 
above holds only for Dirichlet boundary conditions, 
and the Weyl law can be false for K = -A acting in 
L 2 (Q) subject to Neumann or other boundary condi- 
tions unless one assumes that the boundary 3 Q has 
some regularity properties. Assuming always that Q is 
bounded, if 312 is Lipschitz continuous, then A' has dis- 
crete spectrum and the asymptotic eigenvalue distribu- 
tion of K follows the same Weyl law as it does for Dirich- 
let boundary conditions. If 312 is Holder continuous, 
then K has discrete spectrum, but its spectral asymp- 
totics may be non-Weyl. If one makes no assumptions 
on 312, then A need not have discrete spectrum; indeed, 
the spectrum of K may equal [0, oo). 

We finally comment that there is no obvious analogue 
of the variational method for non-self-adjoint opera- 
tors. When self-adjoint theorems have non-self-adjoint 
analogues, this is often because one can use analytic 
continuation arguments or some other aspect of ana- 
lytic function theory. However, the most interesting 
aspects of non-self-adjoint spectral theory are those 
that have no self-adjoint analogues. 

3.6 Evolution Equations 

Applied mathematics and physics yield many examples 
of systems that evolve over time according to one of the 
following equations. 

• The wave equation: d 2 //dt 2 = Lf. 

• The evolution equation: d//df = Lf, which is also 
called the heat equation when appropriate. 

• The Schrbdinger equation: d//dt = -i Lf, where i 
is the square root of — 1. 

In these equations L is commonly a self-adjoint oper- 
ator, but non-self-adjoint applications are of increas- 
ing interest and involve radically new ideas. In many 


applications / is a function on some region U in R w , 
or possibly on a Riemannian manifold, and I is a par- 
tial differential operator. Solving these evolution equa- 
tions starts with specifying the precise class of func- 
tions that is to be considered. If U has a boundary, then 
all admissible f must satisfy certain boundary condi- 
tions, which are intrinsic to the model. The same evo- 
lution equation has entirely different solutions if one 
changes the boundary conditions. Once one has speci- 
fied the problem in sufficient technical detail, one may 
look for solutions that vary very simply in time. For 
example, solutions of the wave equation that are of 
the form f(t) = e lkt g, where g = /( 0), correspond 
to solutions of the eigenvalue problem Ag = A g, where 
A = -k 2 . 

We finally mention that the abstract study of evolu- 
tion equations has led to a well-developed theory of 
one-parameter semigroups. 

4 Schrodinger Operators 

Spectral theory can help one to understand differen- 
tial operators of any order, whether or not they are 
self-adjoint. These operators can act on L 2 (U), where 
U is a region in Euclidean space or in a manifold, if 
the manifold is provided with a measure. This section 
focuses on one class of differential operators, which 
have been studied with great intensity because of their 
applications to quantum theory. 

4.1 Spectral Theory of Schrodinger Operators 

Schrodinger operators are self-adjoint second-order 
partial differential operators. They play a fundamen- 
tal role in describing the properties of elementary par- 
ticles, atoms, and molecules as well as in describing 
the collisions between particles that occur continually 
inside fluids. There is a separate article in this vol- 
ume on the underlying physics (see quantum theory 
[IV.23]), and we restrict attention here to the quali- 
tative properties of a few simple examples, in which 
Planck’s constant h and the masses, charges, and spins 
of the particles do not appear. If relativistic effects are 
important, one needs to use the Dirac operator, whose 
spectral properties are quite different. 

Quantum theory has a special vocabulary because 
of its history. For example, eigenvectors in the rele- 
vant Hilbert space are called bound states and eigen- 
values are often called energy levels, but this is not a 
substantial problem, and we shall use both languages. 
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At the above level of simplification a Schrodinger 
operator is a differential operator of the form 

(Hf)(x) = -§A f(x) + V(x)f(x) 

acting on functions / el 2 (R N ), where A is the Laplace 
operator. The function V is called the potential. For 
quantum particles moving in three dimensions, one 
puts N = 3, while the case N = 2 describes particles 
moving on a flat interface between two media. Both of 
these are subjects of great current interest in connec- 
tion with electronic components and computers. 

The spectral properties of H depend heavily on the 
choice of the potential V. Much, but by no means all, of 
the analysis to date has focused on one of the following 
classes of potential. (All of the results below, and many 
other that we cannot state here, have rigorous proofs 
under suitable technical assumptions.) 

If V(x) — +<» as \x\ — ■ oo, then H has discrete spec- 
trum. In other words, there exists a complete orthonor- 
mal sequence of eigenfunctions <fi n of H whose cor- 
responding eigenvalues A n are monotonically increas- 
ing with lim„,_co A n = +oo. The smallest eigenvalue Ai 
has multiplicity 1 and the corresponding eigenfunction 
<p i is strictly positive, after multiplying by a suitable 
constant. 

If V(x ) — 0 as \x\ — oo, then the spectrum is the 
union of [0, oo ) and a (possibly empty) sequence of neg- 
ative eigenvalues A n that can be written in increasing 
order. If there are infinitely many eigenvalues, then they 
must converge to 0 as n -> oo. If there exist positive 
constants c and e such that \V(x)\ ^ c/ \x\ 2+£ for all 
large enough x , then there can only be a finite number 
of negative eigenvalues. 

One says that H is periodic if there exists a dis- 
crete group G of translations acting on R N such that 
V(x + y) = V(x) for all x e R N and all y e G. It may 
be proved that Spec (H) is the union of a finite or infi- 
nite sequence of intervals [a n , b n ] called bands that are 
separated by gaps (b n , a n+ 1 ). One may label the bands 
and gaps so that a\ < b\ < «2 < ^2 < «3 < 

If Ai = 1 then generically there are infinitely many 
gaps, and the conditions on V under which there are 
only finitely many have been studied in great detail. 
In higher dimensions it is known that there are only 
finitely many gaps. 

In the Anderson model one studies Schrodinger oper- 
ators for which the potential V is random in the sense 
that it is any potential chosen from a precisely defined 
class, which is provided with a probability measure. The 
class and the probability measure are assumed to be 


invariant under a discrete group of translations of R N , 
but individual potentials are not. An ergodicity assump- 
tion implies that the spectrum of H almost surely does 
not depend on the choice of V within the class. The 
spectrum almost surely does not contain any isolated 
eigenvalues. However, its detailed structure depends 
on the dimension N and is not fully understood at a 
rigorous level, in spite of very substantial progress for 
N = 1. 

A simple waveguide is obtained by considering a 
Schrodinger operator on R x U, where U is a bounded 
set in M n_1 ; most of the publications on this problem 
assume that N = 2 or N = 3. This model is called a 
quantum waveguide if one imposes Dirichlet bound- 
ary conditions on R x (5(7); there is also a substan- 
tial literature on similar operators subject to Neumann 
boundary conditions because of their applications in 
fluid mechanics. The two ends of the waveguide may 
point in different asymptotic directions and the shape 
of the waveguide may vary substantially in a bounded 
region within R N . However, it is usually assumed that 
the potential and the cross section of the waveguide 
are asymptotically constant far enough away from the 
origin. Standard methods in spectral and scattering 
theory allow one to determine the continuous or essen- 
tial spectrum of such operators, but there may also be 
eigenvalues. The dependence of the eigenvalues on the 
geometry of the waveguide has been studied in some 
detail because of its potential applications to quantum 
devices. 

All of the above problems have discrete analogues, 
but the case N = 1 has had the most attention. 
One replaces L 2 (R) by / 2 (Z) and considers discrete 
Schrodinger operators of the form 

(■ Hf ) n — fn-l + V nfn + fn+ 1- 

These operators are simpler to analyze than the usual 
type of Schrodinger operators because they are usu- 
ally bounded and one may often base proofs on induc- 
tive arguments that are not possible in the continuous 
context or in higher dimensions. 

4.2 Scattering Theory 

Scattering theory is a difficult subject to explain at an 
elementary level, but it has important spectral impli- 
cations. The subject may be studied at an abstract 
operator-theoretic level, using trace class operators or 
other technology, but this section will describe the 
theory only for the time-dependent Schrodinger equa- 
tion under very standard conditions. The idea is to 
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compare the evolution of a system with respect to two 
different Schrodinger operators, Ho = -|a and H = 
-jA + V, assuming that the potential V (x) decreases 
rapidly enough as |x| — ■ oo. 2 If V is central (rotation- 
ally invariant), it is possible to obtain a very explicit 
analysis by decomposing the Hilbert space L 2 (R N ) with 
respect to the irreducible representations of the rota- 
tion group. Current research focuses on more general 
potentials and multibody scattering. There is also a 
well-developed scattering theory for the wave equation. 

The above assumptions imply that the spectrum 
of Hq equals [0, oo) while the spectrum of H equals 
[0, oo ) together with a finite or infinite set of negative 
eigenvalues. Solving the Schrodinger equation i fit) = 
Hf(t ) leads to the unitary operators 

fit) = 0). 

If there exist positive constants c, a such that | V(x) | ^ 
d |x| 1+£ for all large enough x, then it may be shown 
that the wave operators W± , defined by 

W±f = lim e -1Ht e lHot /, 

exist for all / el 2 ( R N ). The wave operators are isome- 
tries mapping L 2 (R N ) one-one onto the linear subspace 
£ of L 2 (R n ) that consists of all / that are orthogonal to 
every eigenvector of H. The wave operators W± inter- 
twine H and Hq in the sense that HW± = W±Hq, and 
this implies that the nonnegative spectra of H and Hq 
are identical in a very strong sense. It may be shown 
that the scattering operator S defined by 

S = W*W+ = lim e iHot e~ 2iHt e iHot 

t — + 00 

is a unitary operator that commutes with Hq. By using 
Fourier transforms, one may now analyze S in great 
detail, and in simple cases this yields a complete analy- 
sis of H in the sense of the spectral theorem for 
self-adjoint operators. 

This section has not considered the complete analy- 
sis of short-range AT-body scattering by Sigal and Sof- 
fer, with important simplifications by Graf and others. 
This outstanding result was achieved in 1987, but even 
describing the result informally would take far more 
space than is available here. 

4.3 Resonances 

In this section we consider only Schrodinger opera- 
tors H , although it will be clear that some aspects 
of the theory can be developed at a greater level of 


2. For a much more systematic account, see Yafaev (2010). 


generality. The theory is based on the discovery that 
certain classes of Schrodinger operators have pseudo- 
eigenvalues, usually called resonances, with nonzero 
imaginary parts; resonances are supposed to describe 
unstable states and their imaginary parts determine the 
decay rates of the corresponding states. However, every 
Schrodinger operator is self-adjoint, so its spectrum is 
necessarily real and one seems to have a contradiction. 
In fact, a resonance A is associated with a “resonance 
eigenfunction” according to the equation Hf = A/, 
but the function / satisfies a condition at infinity that 
implies that it does not lie in the Hilbert space 3f. 
Because of this, a number of mathematicians still feel 
dissatisfied with the foundational aspects of resonance 
theory. Nevertheless, the calculated resonances corre- 
spond to quantities that can be measured experimen- 
tally, and some comments are called for. The most 
important is that any definition of resonance must 
depend not only on the operator of interest but also 
on some further background data. 

Let H be a Schrodinger operator acting on L 2 (R N ) 
and suppose that the potential V decays at infinity 
rapidly enough. The resolvent operators (zl - II) 1 are 
then integral operators for all z $ R. In other words, 

({zI-Hy l f)(x)=\ G z [x,y)fiy)A N y 

for all z $ R and all sufficiently well-behaved / e 
L 2 (R jv ); G is called the Green function for the resolvent 
operator. The resolvent operators are norm-analytic 
functions of z and the Green functions G z ix,y) are 
pointwise analytic functions of z for every x, y (exclud- 
ing the case x = y, for which the Green function is 
infinite unless N = 1). 

Under certain reasonable conditions one can prove 
that the Green function (but not the associated resol- 
vent operator) can be analytically continued through 
the real axis into a region in the lower half of the com- 
plex plane called the “unphysical sheet.” In this region 
it may have isolated poles, and by definition these are 
the resonances of the operator. It can be proved that 
the positions of the poles do not depend on the choice 
of x, y and that they coincide with the poles of ana- 
lytic functions that are associated with the scattering 
operator for H. 

5 Calculating Eigenvalues 

5.1 Exactly Soluble Problems 

Much of the nineteenth-century research in spectral 
theory was devoted to finding problems that could be 
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solved in closed form, often by using a variety of spe- 
cial functions. In many cases we can now see that the 
ultimate explanation for this was the presence of some 
group symmetry. Some spectral problems for ordi- 
nary differential equations are soluble using orthog- 
onal polynomials, but the theory of orthogonal poly- 
nomials has now grown far beyond the confines of 
spectral theory or differential equations. 

A typical exactly soluble example is provided by cir- 
culant matrices. These are nxn matrices A whose 
entries obey 


Cl-Y — 


if r ^ 5 , 


CLy — s+n if f S, 

for some coefficients a r . The eigenvalues of A may 
be computed by using the finite Fourier transform to 
diagonalize A. They are the numbers 

n 

A r = X a s e 2TTlrsln , 

S=1 

where 1 ^ r ^ n. The obvious infinite generalization of 
this is the operator Af = a * f acting on L 2 (R), where 
* denotes convolution, defined by 

r 00 

(Af)(x) = a(x - y)f(y) Ay. 

J —oo 

Every convolution operator commutes with the group 
of all translations (T s f) (x) = fix + s). A may be diag- 
onalized by using the Fourier transform, the spectrum 
of A being 0 together with the set of all 


At = a(x)e 


-i xk 


d.v, 


where k e R. 

Convolution operators lie at the heart of problems 
involving digital signal processing and image enhance- 
ment. The fast Fourier transform [II. 10] algorithm 
of Cooley and Tukey, which was actually first described 
by Gauss in 1805, was seminal in enabling computa- 
tions to be carried out. 

Any constant-coefficient partial differential operator 
commutes with the group of all translations on R N 
and may similarly be diagonalized, i.e., represented 
as a multiplication operator, by means of Fourier 
transforms. 

5.2 Perturbation Techniques 

Before the advent of computers, calculating the eigen- 
values of problems that are not exactly soluble was 


extremely laborious, and much effort was devoted 
to reducing the work involved. Inevitably, the results 
obtained were only applicable in special situations. 
Perturbation theory allows one to calculate the eigen- 
values of an operator A + tB for small t e C when the 
eigenvalues and eigenvectors of A are already known. 
There are two approaches to the theory, applicable in 
different situations. 

Analytic perturbation theory provides theorems that 
prove, under suitable assumptions, that every eigen- 
value of A + tB may be written as a convergent series 
of the form 

A {t) = a o -b a\t + a2t 2 + ■ ■ ■ , 

where a o is an eigenvalue of A and a n may be calcu- 
lated directly from A and B. In practice, one often needs 
to calculate only a few terms of the series to obtain a 
reasonable approximation (see Kato 1995). 

If the eigenvalue of A that is being studied has multi- 
plicity greater than 1, the above method needs modifi- 
cation. The power series above has to be replaced by a 
series involving fractional powers of t, as one may see 
by calculating the eigenvalues of the matrix 

(0 t > 


A -t- tB 


1 0 


in closed form. The technology for dealing with such 
problems was well established by the 1960s. 

The use of perturbation expansions of the above type 
is easiest to justify in finite dimensions. If the Hilbert 
or Banach space 3f is infinite dimensional, then some 
condition is needed and the easiest is the assumption 
that B is bounded or relatively bounded with respect 
to A. The latter condition requires the existence of a 
constant as (0,1) and a constant b > 0 such that 

\\Bf\\^a\\Af\\ + b\\f\\ 
for all / 6 Dom(A). 

Asymptotic perturbation theory allows one to treat 
certain situations in which convergent perturbation 
expansions do not exist. An alternative approach to 
some such problems depends on what is called semi- 
classical analysis (see Zworski 2012). This involves 
much more geometrical input than most theorems in 
analytic perturbation theory. Much of the literature 
in this held involves the theory of pseudodifferential 
operators, but an impression of the issues involved can 
be conveyed by the following example, in which the 
operator involved acts in L 2 (R N ). 

One assumes that the operator Hh is self-adjoint 
and that it depends on a small parameter in a rather 
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special manner; the parameter is normally denoted by 
h because of the origins of the subject in quantum 
mechanics, where h is Planck’s constant. One assumes 
that Hh is obtained by the “quantization” of a classical 
Hamiltonian H c i(p,q), where p,q e R N . There are, in 
fact, several types of quantization, and the difference 
between these needs to be taken into account. A typical 
example is the Schrodinger operator 

(H h f)(x) = -h 2 (Af)(x) + V(x)f(x) 

obtained by quantizing H c \(p, q) = p 2 + V (q). The idea 
is that one can understand the h — ■ 0 asymptotics of 
various features of Hh by utilizing intuitions that come 
from differential geometry; for example, by an analysis 
of neighborhoods of the critical points of the potential. 
The nature of the results obtained precludes our delv- 
ing more deeply into this important field. However, it 
should be stated that, when perturbation expansions 
in powers of h exist, they are usually asymptotic series 
rather than convergent series. 

5.3 Numerical Analysis 

This is not the place to discuss methods for comput- 
ing the eigenvalues of partial differential operators in 
detail, but some general remarks may be in order. We 
assume that one wishes to determine several eigen- 
values of a self-adjoint differential operator A that is 
known to have an infinite sequence of eigenvalues A„ 
that can be listed in increasing order, each eigenvalue 
being repeated according to its multiplicity, with A„ 
diverging as n — ■ co at a known rate. 

There are special methods that enable one to calcu- 
late large numbers of eigenvalues of an ordinary differ- 
ential operator in one dimension with great accuracy. 
Most general-purpose programs concentrate on two- 
dimensional problems. In three dimensions all meth- 
ods are computationally very demanding, because of 
the large size of the matrix approximations needed. 
As a result these programs are often optimized for 
a specific problem that is of commercial, military, or 
scientific importance. 

There are two general approaches to the calcula- 
tion of eigenvalues, which we call a priori and a pos- 
teriori, and each of them depends on calculating the 
eigenvalues associated with the truncation of A to a 
finite-dimensional subspace L of high dimension. 

The a priori method is far older and depends on the- 
orems that state that, if one carries out the compu- 
tations for a properly selected increasing sequence of 
subspaces Lh, where h > 0 is a small parameter, then 


the computed eigenvalues converge to the true eigen- 
values of A as h — ■ 0. Such theorems also provide an 
estimate of the convergence rate. The weakness of this 
approach is that the apparent convergence is usually 
far faster than that yielded by the theorems. 

Assuming that the differential operator acts in a 
bounded region U inR 2 , the finite-element method 
[11.12] is a prescription for generating a large finite- 
dimensional space of functions on U starting from 
a mesh, i.e., a subdivision of U into a collection of 
small triangles, whose sizes are or order h (see Bofh 
2010). This is carried out by the program, as is the 
next stage, which is the construction of a large sparse 
matrix, whose eigenvalues approximate a substantial 
number of the smaller eigenvalues of the differential 
operator. Under favorable circumstances one can make 
a priori estimates of the difference between the com- 
puted eigenvalues and the true eigenvalues. However, 
programs of this type often try to determine which part 
of the mesh is responsible for most of the error and 
then refine the mesh locally. It is known that one has 
to pay particular attention to corners of the region, par- 
ticularly reentrant corners, so the mesh is usually made 
much finer there from the start. 

A posteriori methods obtain rigorous upper and 
lower bounds on the eigenvalues of A from numeri- 
cal computations; these bounds might be based on the 
variational method described earlier (see Rump 2010, 
part 3). This method has three major differences from 
the a priori method. The first is that implementing the 
method has a considerably higher computational cost 
than that of the a priori method. If it is implemented 
using interval arithmetic, then one needs to avoid var- 
ious computational traps that are by now well known 
to experts. The final feature is that one does not need 
to prove that the upper and lower bounds converge to 
the eigenvalue of interest as h — ■ 0. One simply per- 
forms the calculation for smaller and smaller h until 
either one obtains a result for which the error bounds 
are sufficiently small, or one accepts that the method is 
not useful with the available computational resources. 

Which of the two methods one uses must depend 
on the circumstances. A lot of ready-made code exists 
for the a priori method. Results obtained by using 
the a priori method may be the starting point even if 
one intends to use the a posteriori method eventually, 
because it is always useful to have a good idea about the 
approximate location of the solution before carrying 
out further rigorous calculations. 
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6 Other Applications of Spectral Theory 

In this section we gather together a miscellaneous col- 
lection of applications. Although it is far from com- 
plete, it may serve to illustrate the vigor of the held. 

6. 1 Fredholm Operators 

The results below can be extended to unbounded oper- 
ators and to Banach spaces, but it is easiest to explain 
them in a more restricted context. 

A bounded operator A acting on a Hilbert space 3-f is 
said to be Fredholm if its kernel Ker(A) is finite dimen- 
sional, its range is closed, and the orthogonal comple- 
ment Coker(A) of its range is also finite dimensional. 
The integer 

Ind(A) = Dim(Ker(A)) - Dim(Coker(A)) 

is called the index of A. The index is a very useful invari- 
ant of an operator because of the following fact. If A is 
Fredholm then any small enough perturbation of A is 
also Fredholm and the perturbation has the same index 
as A. Hence, if A s is a one-parameter family of opera- 
tors depending norm continuously on 5, and each A s 
is Fredholm, then Ind(A 5 ) does not depend on 5. This 
may be used to prove that the index of an elliptic dif- 
ferential operator acting on the space of sections of a 
vector bundle does not depend on the detailed values 
of its coefficients. This fact is a key ingredient in the 
Atiyah-Singer index theorem, which has ramifications 
throughout global analysis. 

Under the same hypotheses, if Ind(Ao) f 0 then 
Ind(A 5 ) 0, so 0 is an eigenvalue of A s or of A*. 
In either case, 0 6 Spec (As). This type of argument 
can enable one to determine more about the spectra 
of non-self-adjoint operators than one could by more 
elementary methods. 

A particular example of this is afforded by Toeplitz 
operators, whose applications spread far and wide (see 
Bottcher and Silbermann 2006). Given a sequence a 6 
F 1 (Z+ ), one can define the associated Toeplitz operator 

A on f 2 {Z+) by 

00 

(A/) n — -m fm 

m = 0 

for all / e £ 2 {Z+). It maybe shown that ||A|| ^ ||a||i. 
The function 

00 

cr(0) = ^ «„e~ ln0 
n = 0 

is called the symbol of A, and under the above assump- 
tions it is a bounded continuous function that is 


periodic with period 2tt. One may in fact define a 
Toeplitz operator associated with any bounded mea- 
surable periodic symbol, but the technicalities become 
much greater, and some of the key results are different. 

Under the present assumptions one has 

{cr(0): 0^0^ 2rr} £ Spec(A), 

but more is actually true. If a complex number A does 
not lie in the range of cr, then A is Fredholm and its 
index equals the winding number of the closed curve cr 
around A. A theorem of Gohberg states that the spec- 
trum of A consists of S = {cr(0) : 0 ^ 0 ^ 2tt} together 
with every A $ S for which the winding number is 
nonzero. 

All of these results can be generalized to vector- 
valued Toeplitz operators— a fact that extends their 
possible applications substantially. 

6.2 Spectral Geometry 

In this subject one investigates the relationship be- 
tween the geometry of a Riemannian manifold, or a 
bounded region in Euclidean space, and the eigenvalues 
\ n of the associated Laplace-Beltrami operator H = -A 
(see Gilkey 2005). The Weyl law, already described, 
establishes that one may determine the volume of a 
region in M. N from the eigenvalues of a related Laplace 
operator, but far more than this is possible. In this con- 
text it turns out to be easier to investigate the small t 
asymptotics of 

oo 

tr(e~ m ) = X e ~ A " f 

n= 1 

than the large s asymptotics of the spectral counting 
function Nh (s) defined earlier. In 1966 Mark Kac asked 
whether one can “hear the shape of a drum”; more 
specifically he asked whether the size and shape of a 
bounded Euclidean region U are uniquely determined 
by the eigenvalues of the associated Laplace operator, 
assuming Dirichlet boundary conditions. Many positive 
results of great interest were obtained in the course of 
studying this problem, but in 1992, Gordon, Webb, and 
Wolpert constructed a very simple counterexample in 
which the two plane regions concerned are fairly simple 
polygons. 

6.3 Graph Laplacians 

The spectral analysis of graphs has many aspects, but 
the first problem is that the word “graph” has more 
than one meaning. The following two interpretations 
lead to different mathematical results. 
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One may define a graph [11.16] to be a finite or count- 
able set X of vertices together with a set T of undirected 
edges. In this case the adjacency matrix A is defined 
by putting A x , y = 1 if (x,y) e E, and A x , y = 0 other- 
wise. It may be seen that A is associated with a bounded 
operator on T 2 (X) provided the degrees of the vertices 
are uniformly bounded, where the degree of a vertex 
x is defined to be #{y : (x,y) e E}. The spectrum of 
( X , E) is by definition the spectrum of A, which is real 
because A is self-adjoint. The spectrum provides a set 
of invariants of a graph. The spectral theory of finite 
graphs is now a subject in its own right (see Brouwer 
and Haemers 2012). 

Physicists are also interested in quantum graphs , 
in which the edges are continuous intervals of given 
lengths (see Berkolaiko and Kuchment 2013). The 
Hilbert space of interest is then L 2 (E), where E is 
the union of the edges in E. One considers the 
second-order differential operator on L 2 {E) given by 
( A/)(x ) = —f"(x) on each edge. In order to obtain 
a self-adjoint operator one has to impose appropriate 
boundary conditions at each of the vertices. The sim- 
plest way of doing this is to require that if / lies in 
the domain of A, then / is continuous at each vertex 
and the sum of the outward-pointing first derivatives at 
each vertex vanishes. These are called free or Kirchhoff 
boundary conditions. 

There are two reasons for studying quantum graphs. 
The first is that they are mild generalizations of the one- 
dimensional theory on M, and there is some interest in 
determining how the geometry of the graph affects the 
spectral theory of the operator A. The second is that 
quantum graphs arise as limits of quantum waveguides 
as the width of the waveguide decreases to zero. People 
are interested in quantum waveguides because of their 
potential applications, and quantum graphs provide a 
useful first approximation to their properties. 

6.4 Nonlinear Eigenvalue Problems 

This article has concentrated on linear spectral theory, 
which is by far the best-developed part of the subject. 
There are two types of generalization of this theory to 
a nonlinear context. In the first, one attempts to solve 
Af = A / when f lies in some Banach or Hilbert space 
of functions and A is a nonlinear operator. The earli- 
est and best-known problem of this type is called the 
Korteweg-de Vries equation. It is exactly soluble, sub- 
ject to the solution of an associated linear inverse scat- 
tering problem. There are many papers that general- 
ize specified spectral properties of linear second-order 


elliptic eigenvalue problems to the nonlinear context, 
under suitable hypotheses. However, nonlinear prob- 
lems are much harder, whether one approaches them 
analytically or numerically, and current methods of 
solution depend heavily on the equation of interest. 

Another type of nonlinear eigenvalue problem looks 
for solutions / of equations such as 

A 2 A 2 f + AA t f + A 0 f = 0, 


where A e C and Ao, A i, A 2 are all linear opera- 
tors. This is called a quadratic eigenvalue prob- 
lem [IV.10 §5.8], and it is the most important case 
of the more general polynomial eigenvalue problem. 
It has applications to engineering, the wave equation, 
control theory, nonlinear boundary-value problems, 
oceanography, and other fields. 

If A 2 is invertible then the above problem is equiv- 
alent to a standard eigenvalue problem for the block 
matrix 


/-A1A2 1 A 

\-A0A2 1 0 J 


(3) 


This transformation does not reduce the theory to a 
triviality because qualitative properties of the spectrum 
may be less easy to deduce from (3) than from the origi- 
nal problem. For example, even if Ao , Ai , A 2 are all self- 
adjoint, the matrix (3) is non-self-adjoint, with no obvi- 
ous special structure. There is a rich theory of finite- 
dimensional quadratic pencils, which classifies them 
into many different types, each with its own special 
features. 
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IV.9 Approximation Theory 

Annie Cuyt 


1 Introduction 

Approximation theory is an area of mathematics that 
has become indispensable to the computational sci- 
ences. The approximation of magnitudes and func- 
tions describing some physical behavior is an integral 
part of scientific computing, queueing problems, neu- 
ral networks, graphics, robotics, network traffic, finan- 
cial trading, antenna design, floating-point arithmetic, 
image processing, speech analysis, and video signal 
filtering, to name just a few areas. 

The idea of seeking a simple mathematical function 
that describes some behavior approximately has two 
sources of motivation. The exact behavior that one is 
studying may not be able to be expressed in a closed 
mathematical formula. But even if an exact description 
is available it may be far too complicated for practical 
use. In both cases a best and simple approximation is 
required. What is meant by best and simple depends on 
the application at hand. 

The approximation is often used in a computer 
implementation, and therefore its evaluation needs to 
be efficient. The simplest and fastest functions for 
implementation are polynomials because they use only 
the fast hardware operations of addition and mul- 
tiplication. Next come rational functions, which also 
need the hardware operation of division, one or more 
depending on their representation as a quotient of 
polynomials or as a continued fraction. Rational func- 
tions offer the clear advantage that they can reproduce 
asymptotic behavior (vertical, horizontal, slant), which 
is something polynomials are incapable of doing. For 
periodic phenomena, linear combinations of trigono- 
metric functions make good candidates. For growth 


models or decaying magnitudes, linear combinations 
of exponentials can be used. 

In approximation theory one distinguishes between 
interpolation and best approximation problems. In 
the former one wants the approximate model to take 
exactly the same values as prescribed by data given at 
precise argument values. In the latter a set of data (not 
necessarily discrete) is regarded as a trend and approx- 
imated by a simple model in one or other best sense. 
The difference is formalized in the following sections. 

Besides constructing a good and efficient mathemati- 
cal model, one should also take the following two issues 
into account. 

• What can be said about the convergence of the 
selected mathematical model? In more practical 
terms: does the model improve when one adds 
more data? 

• How sensitive is the mathematical model to per- 
turbations in the input data? Data errors are usu- 
ally unavoidable, and one wishes to know how 
much they can be magnified in the approximation 
process. 

In the following sections we comment on both issues 
where appropriate. We do not aim to discuss con- 
vergence or undertake a sensitivity analysis for every 
technique. 

Despite the need for and interest in multidimensional 
models and simulations, we restrict ourselves here 
mostly to one-dimensional approximation problems. In 
the penultimate section we include some brief remarks 
on multivariate interpolation and approximation and 
its additional complexity. 

2 Numerical Interpolation 

Let data fi be given at points x,- e [a,b], where i = 
0, . . . ,n. We assume that if some of the points X/ are 
repeated, then it is not only the value of some under- 
lying function f(x) that is given (or measured) at Xi 
but also as many higher derivatives / (,) (xi) as there 
are copies of the point x,. The interpolation problem 
is to find a function of a specified form that matches 
all the data at the points. In this section we deal with 
the two extreme cases: the one in which all the points 
Xi are mutually distinct and no derivative information 
is available, and the one in which the value of the func- 
tion and that of the first n derivatives are all given at 
one single point xq. Of course, intermediate situations 
can also be dealt with. The approximating functions 
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that we consider are polynomials, piecewise polyno- 
mials (splines), and rational functions, each of which 
has particular advantages. Finally, we present the con- 
nection between exponential models and sparse inter- 
polation, on the one hand, and exponential models and 
Pade approximation, on the other. 


to p n (x) (which does not destroy the previous inter- 
polation conditions since it evaluates to zero at all the 
previous xy) and to complement the recursive scheme 
for the computation of the divided differences with the 
computation of the 

/[0, 1, . . . , k, n + 1], k = 0 n. 


2.1 Polynomial Interpolation 

For n + 1 given values /y = f(xt) at mutually dis- 
tinct points Xj, the polynomial interpolation problem 
of degree n, 

n 

PnM = X a j xJ > Pn(Xi) = fi, i = 0, . . . , n, (1) 
j=0 

has a unique solution for the coefficients ay. Now let 
us turn to the computation of p n (x). Essentially, two 
approaches can be used, depending on the intended 
subsequent use of the polynomial interpolant. If one 
is interested in easily updating the polynomial inter- 
polant by adding an extra data point and consequently 
increasing the degree of p n (x), then Newton’s formula 
for the interpolating polynomial is very suitable. If one 
wants to use the interpolant for several sets of values 
fi while keeping the points xy fixed, then Lagrange’s 
formula is most appropriate. A simple rearrangement 
of the Lagrange form as in (2) below results in the 
barycentric form, which combines the advantages of 
both approaches. 

In the Newton form one writes the interpolating 
polynomial p n M as 

PnM = b 0 + bi(x - x 0 ) + b 2 (x - x 0 )(x - xi) 

+ ■ ■ ■ + b„(x - xq) ■ ■ ■ (x - x n - 1 ). 

The coefficients bj then equal the divided differences 
bj = f[0, ... ,j] obtained from the recursive scheme 

fW = fj - 3 = 0 n, 

/.(). /! A . j = 1 n, 

Xj - Xq 

/[0,l,...,k-l,fc,j] 

= /[ 0.1 k- l,,/j -/[()■ 1 fc-l.fc] 

Xj - x k 

k,j= 2 ,n. 

Newton’s form for the interpolating polynomial is very 
handy when one wants to update the interpolation with 
an additional point (x n +i, fn+i). It suffices to add the 
term 

b n + l(X - Xq) ■ ■ ■ (X - X n ) 


In the Lagrange form, which is especially suitable if 
the interpolation needs to be repeated for different sets 
of fi at the same points xy, another form for p n M is 
used. We write 

n n , x 

PnM = X CjPjM, PjM = n 77T7T- 
j = o k= o M Xk> 

ktj 

The basis functions fj (x) satisfy a simple interpolation 
condition themselves, namely, 


Pj(Xi) 


f 0 for j f i, 
[l for j = i. 


The choice Cy = fj for the coefficients therefore solves 
the interpolation problem. So when altering the /y, 
without touching the xy that make up the basis func- 
tions fij(x), it takes no computation at all to get the 
new coefficients C/. 

The barycentric form of the interpolation polyno- 
mial, 

n W ' 

PnM = (X - X 0 ) ■ ■ ■ (X - X n ) X 3 fj ■ 

r. X X j 
3 = 0 3 



Uj 


is easy to update and is backward stable [1.2 §23] for 
evaluation of p n M- 

The sensitivity of the polynomial interpolant ex- 
pressed in the Lagrange form is measured by the 
value 

n 

L n = max X \PjM I, (3) 

a^x^b ~“ Q 3 


which is also known as the Lebesgue constant. The 
growth rate of L n with n is only logarithmic when the 
interpolation points are as in (5) below. This is the 
slowest possible growth for polynomial interpolation. 

Despite the simplicity and elegance of polynomial 
interpolation, the technique has a significant draw- 
back, as we discuss next: it may not converge for data 
fi = f(Xi) given at arbitrary points Xy, even if /(x) is 
continuous on [a, b]. 
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2.2 The Runge Phenomenon 


What happens if we continue updating the interpola- 
tion problem with new data? In other words, what hap- 
pens if we let the degree n of the interpolating polyno- 
mial p n (x) increase? Will the interpolating polynomial 
of degree n become better and better? The answer is 
no, at least not for freely chosen points Xj. To see what 
can go wrong, consider 


fix) 


1 

l + 25x 2 ’ 


-1 < x ^ 1, 


(4) 


and take equidistant interpolation points Xj = -1 + 
2i/n,i — 0,...,n. The error (/- p)(x) toward the end- 
points of the interval then increases dramatically with 
n. Take a look at the bell-shaped /(x) and the inter- 
polating polynomial p n (x) for n = 10 and n = 20 in 
figure 1. 

This phenomenon is called Runge’ s phenomenon, 
after Carl Runge, who described this behavior for real- 
valued interpolation in 1901. An explanation for it 
can be found in the fundamental theorem of algebra, 
which states that a polynomial has as many zeros as 
its degree. Each of these zeros can be real or complex. 
So if n is large and the zeros are all real, the poly- 
nomial under consideration displays rather oscillatory 
behavior. 

On the other hand, under certain simple conditions 
for fix) besides continuity in [a, b ], it can be proved 
that if the interpolation points x; equal 


Xj = 


a + b 
2 


b - a 
2 


cos 


I (2i + 1 ) 7T \ 

V 2(n+ 1) )' 


i = 0, . . . , n. 


(5) 


where the values cos((2i + 1 )/{n + Dirt 12)) are the 
zeros of the Chebyshev polynomial of the first kind of 
degree n+ 1 (defined in section 3.3), then 

lim ||/ - p n |U = lim max l(/ - p n )(x)| = 0. 

n-oo 1,1] 

The effect of this choice of interpolation points is 
illustrated in figure 2(a). A similar result holds if the 
zeros of the Chebyshev polynomial of degree n + 1 are 

replaced by the extrema cos(i7T/n), i = 0 n, of the 

Chebyshev polynomial of degree n. 

In order to make use of this result in real-world appli- 
cations, where the interpolation points Xj cannot usu- 
ally be chosen arbitrarily, interpolation at the Cheby- 
shev zeros is mimicked, for instance by selecting a 
proper subset of interpolation points Xj from a fine 
equidistant grid, with Xj « Xj from (5). The grid is 
considered to be sufficiently fine when the distance 




Figure 1 (a) Degree-10 and (b) degree-20 equidistant inter- 
polation (solid lines) for function / in (4) (dashed lines). 


between the points ensures that a grid point nearest 
to a Chebyshev zero Xj is never repeated. In a coarse 
grid, the same grid point may be the closest one to more 
than one Chebyshev zero, especially toward the ends of 
the interval [-1,1]. 

This technique is called mock-Chebyshev interpola- 
tion. For comparison, in figure 2(b) we display the 
degree-20 mock-Chebyshev interpolant with the inter- 
polation points selected from an equispaced grid with 
gap 1/155. 

If a lot of accurate data points have to be used in 
an interpolation scheme, then splines, which are dis- 
cussed in the next section, offer a better alternative 
than a monolithic high-degree polynomial interpolant. 
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Figure 2 Degree-20 (a) Chebyshev and (b) mock-Chebyshev 
interpolation (solid lines) for function / in (4) (dashed lines). 

2.3 Spline Interpolation 

In order to avoid the Runge phenomenon when inter- 
polating large data sets, piecewise polynomials, also 
called splines, can be used. To this end we divide the 
data set of n + 1 points into smaller sets, each con- 
taining two data points. Rather than interpolating the 
full data set by one polynomial of degree n, we inter- 
polate each of the smaller data sets by a low-degree 
polynomial. These separate polynomial functions are 
then pieced together in such a way that the resulting 
function is as continuously differentiable as possible. 

Take, for instance, the data set (Xi ,fi) and consider 
linear polynomials interpolating every two consecutive 
( Xi,fi ) and (Xi+i,/i + i). These linear polynomial pieces 
can be joined together at the data points (x; , fi ) to pro- 
duce a piecewise-linear continuous function or polyg- 
onal curve. Note that this function is continuous but 
not differentiable at the interpolation points since it is 
polygonal. 



Figure 3 A piecewise-cubic function that is 
not twice continuously differentiable. 

If we introduce two parameters, A and D, to respec- 
tively denote the degree of the polynomial pieces and 
the differentiability of the overall function, where obvi- 
ously D ^ A (even D < A to avoid an overdetermined 
system of defining equations, as explained below), then, 
for the polygonal curve, A = 1 and D = 0. With A = 2 
and D = 1, a piecewise-quadratic and smooth (mean- 
ing continuously differentiable in the entire interval 

[xo x n ]) function is constructed. The slope of a 

smooth function is a continuous quantity. With <4 = 3 
and D = 2, a piecewise-cubic and twice continuously 
differentiable function is obtained. Twice continuously 
differentiable functions also enjoy continuous curva- 
ture. Can the naked eye distinguish between contin- 
uous and discontinuous curvature in a function? The 
untrained eye certainly cannot! As an example we take 
the cubic polynomial pieces 

ci(x) = x 3 - x 2 + x + 0.5, xe[— 1,0], 

C 2 (x ) = x 3 + x 2 + x + 0.5, x 6 [0, 1], 

and join these together at x = 0 to obtain a new 
piecewise-cubic function c(x) on [-1,1]. The result 
is a function that is continuous and differentiable at 
the origin, but for the second derivatives at the origin, 
we have lim x _o- c (2) (0) = -2 and lim x _o+ c <2) (0) = 2. 
Nevertheless, the result of the gluing procedure shown 
in figure 3 is a very pleasing function that at first sight 
looks fine. But while A equals 3, D is only 1. 

Since a trained eye can spot these discontinuities, the 
most popular choice for piecewise-polynomial inter- 
polation in industrial applications is A = 3 and D = 
2. Indeed, for manufacturing the continuity of the 
curvature is important. 
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Let us take a look at the general situation where 
A = m andD = m-1, for which the resulting piecewise 
polynomial is called a spline. Assume we are given the 
interpolation points xq,...,x„. With these n + 1 points 
we can construct n intervals \Xi,Xi+i\. The points xq 
and x n are the endpoints, and the other n - 1 interpo- 
lation points are called the internal points. If A = m, 
then for every interval [Xj, x;+i ] we have to determine 
m + 1 coefficients because the explicit formula for the 
spline on [xi,Xi+i] is a polynomial of degree m: 

Six) = st(x), x 6 [xj,Xi+i], i = 0, 1, 

m 

Si (x ) = X a^xf 
j = o 

So, in total, n(m + 1) unknown coefficients a' 1 ' have 
to be computed. From which conditions? There are the 
n + 1 interpolation conditions Six i) = ft, and we have 
the smoothness or continuity requirements at the inter- 
nal points, meaning that a number of derivatives of 
5;-i(x) evaluated at the right endpoint of the domain 
[xn,Xi] should coincide with the derivatives of sfx) 
when evaluated at the left endpoint of the domain 
[Xi,Xi+i]: 

sj k { (xi) = sl k) (xi), i — 1, . . . , n — 1, k = 0, . . . , m — 1. 

The latter requirements add another ( n - l)m conti- 
nuity conditions. This brings us to a total of n + 1 + 
( n - l)m = n(m + 1) - m + 1 conditions for n(m + 1) 
unknowns. In other words, we lack m-1 conditions 
to determine the degree-m piecewise-polynomial inter- 
polant with overall smoothness of order m-1. When 
m = 1, which is the case for the piecewise-linear spline 
or the polygonal curve, no conditions are lacking. When 
m = 2, a value for 5 q(xo) is usually given as an addi- 
tional piece of information. When m = 3, which is the 
case for the widely used cubic spline, values for Sq (xo) 
and (x n ) are often provided (the cubic spline with 
clamped end conditions) or they are set to zero (the 
natural cubic spline). 

The natural cubic spline interpolant has a very ele- 
gant property, namely, that it avoids oscillatory behav- 
ior between interpolation points. More precisely, for 
every twice continuously differentiable function fix) 
defined on [a, b] and satisfying f{Xi) = fi for all i, we 
have 

f S"(x) 2 dx^ [ fix) 2 Ax. 

J a Ja 

A simple illustration is given in figure 4 for n = 6 with 
Xi = i,i = 1 , . . . , 6. 




Figure 4 (a) The polynomial interpolant 
and (b) the natural cubic spline. 

2.4 Pade Approximation 

The rational equivalent of the Taylor series partial 
sum is the irreducible rational function r k jix) = 
Pk.tif I dk.fM with numerator of degree at most k 
and denominator of degree at most £ that satisfies 

r k}i x o) = f (l) (xo), i = 0, 1, . . . , n, (6) 

with n as large as possible. It is also called the [k/l\ 
Pade approximant. The aim is to have n = k + £. Note 
that we are imposing one fewer condition than the total 
number k + £ + 2 of coefficients in The reason is 
that one degree of freedom is lost because multiplying 
p k _l i and q k j by a scalar does not change r k j. 

A key question is whether n can be less than fc + 
£. The answer to this question requires some analy- 
sis. Computing the numerator and denominator coeffi- 
cients of r k jix) from (6) gives rise to a nonlinear sys- 
tem of equations. So let us explore whether the Pade 
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approximant can also be obtained from the linearized 
approximation conditions 

(f <lk,e ~ Vk,t) (l) (x o) = 0, i = 0, l,...,k + £. (7) 

We denote f^Hxo)/f- by dj, where dj = 0 for j < 
0. The linearized conditions (7) always have at least 
one nontrivial solution for the numerator coefficients 
a o, . . . ,ak and the denominator coefficients 
because they form a homogeneous linear system of 
k + £ + 1 conditions in k + £ + 2 unknowns: 

dobo = ao, 
d\bo -t- dob\ = ui, 

d^bo + ■ ■ ■ + d k __fbf = a k , 
dk+\bo + ■ ■ ■ + d k -£+ib^ = 0, 



Figure 5 Pade approximants r^ofx) (dotted line) and 
r^, 4 (x) (dashed line) for arctan(x) (solid line). 


dk+jb o + ■ ■ ■ + dkbf = 0. 

Moreover, all solutions p k ,t{x) and dk,l( x ) °f (7) are 
equivalent in the sense that they have the same irre- 
ducible form. Every solution of (6) with n = k + £ 
therefore also satisfies (7), but not vice versa. From 
Pk,r(x) and q k ,£(x) satisfying (7) we find, for the 
unique irreducible form p k y(x) I q kt (x), that 

(/ - r k j) M (x 0 ) =0, i = 0, ...,k' +£’ + r, 

k' = dp* p £' = dq* kP r > 0, 

where dp denotes the degree of the polynomial p. In 
some textbooks, the [k/l] Pade approximation prob- 
lem is said to have no solution if k' + £' + r < k + £\ 
in others, the Pade approximant r k j is identified with 
r k',I' = P k jldk £ d 1 that is the case (this is the conven- 
tion we adopt here). Let us illustrate the situation with 
a simple example. Take xo = 0 with do = 1, di = 0, 
d 2 = 1, and k = 1 = £. The linearized conditions (7) are 
then 

bo = ao, b i = a\, bo = 0. 

A solution is given by pij (x) = x and qij (x) = x. We 
therefore find ryi(x) = 1, k' = 0, £' = 0, and 

(/-ri, 1 ) (2) (x 0 ) = 2^0. 

Since r = 1, we have k'+£'+r = l<k + £ = 2. 

This kind of complication does not occur when £ = 0. 
The Pade approximant r k ,o(x) is then merely the Tay- 
lor series partial sum of degree fc. But when asymptotic 
behavior needs to be reproduced, a polynomial func- 
tion is not very useful. In figure 5 one can compare the 
Taylor series partial sum of degree 9 with the [5/4] 
Pade approximant for the function /(x) = arctan(x). 


Table 1 The Pade table for sin(x). 


0 1 

lx x 

2 x x 

3 X - gX 3 X - gX 3 


2 

X 

1 + 5*2 
x 

1 + 5* 2 

so* 3 + * ) 

(1 + 55* 2 ) 


Pade approximants can be organized in a table, where 
the numerator degree indicates the row and the denom- 
inator degree the column. To illustrate this we give 
part of the Pade table for f(x ) = sin(x) in table 1. 
A sequence of Pade approximants in the Pade table can 
converge uniformly or in measure only to a function 
fix) that is meromorphic in a substantial part of its 
domain. 

2.5 Rational Interpolation 

The rational equivalent of polynomial interpolation 
at mutually distinct interpolation points X; consists 
of finding an irreducible rational function r k j(x), of 
numerator degree at most k and denominator degree 
at most £, that satisfies 

x k,e( x i) = fi, i = 0, ...,k + £, (8) 

where /, = /(Xj). Instead of solving (8) one considers 
the linearized equations 

ifdk,£ - Pk,£)( x i) = 0, i = 0, ...,k + £, (9) 
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where Pkjix) and are polynomials of respec- 

tive degree k and £. Condition (9) is a homogeneous lin- 
ear system of k + £ + 1 equations in fc + £ + 2 unknowns, 
and it therefore always has a nontrivial solution. More- 
over, as in the Pade approximation case, all solutions 
of (9) are equivalent in the sense that they deliver the 
same unique irreducible rational function. 

In the construction of the irreducible form r^rix) 
of Pk.nix) / dk.nix), common factors in numerator and 
denominator are canceled, and it may well be that 
Ykj does not satisfy the interpolation conditions (8) 
anymore, despite Pk,t and dk,£ being solutions of (9), 
because one or more of the canceled factors may be of 
the form x - X, with X; an interpolation point. A sim- 
ple example illustrates this. Let Xo = 0, Xi = 1, X 2 = 2 
with /o = 0, f\ = 3, /2 = 3, and take k = 1 = £. The 
homogeneous linear system of interpolation conditions 
is then 


ClQ = 0 , 

3 (bo + £>i) - («o + «i) = 0, 

3(bo + 2bi) - (ao + 2«i) = 0. 


A solution is given by pip(x) = 3x and qi,i(x) = x. 
Hence, ri,i(x) = 3 and clearly rij(xo) f fo ■ The 
interpolation point xo is then called unattainable. This 
problem can be fixed only by increasing the degrees k 
and/or £ until the interpolation point is attainable. Note 
that unattainable interpolation points do not occur in 
polynomial interpolation i£ = 0). 

A well-known problem with rational interpolation 
and Pade approximation is the occurrence of undesir- 
able poles in the interpolant r^ix). One way to avoid 
this is to work with preassigned poles, either explicitly 
or implicitly, by determining the denominator polyno- 
mial dkjix) a priori. All k + £ + 1 interpolation condi- 
tions are then imposed on the coefficients of the numer- 
ator polynomial, and consequently the degree of the 
numerator is raised to k + £. 

Let the interpolation points x, be ordered such that 
xo < xi < ■ ■ ■ < x n with k + £ = n. A popular choice 
for the denominator polynomial that guarantees a pole- 
free real axis, unless the location of the poles needs to 
be controlled by other considerations, is 


n n 

dn.nlx) = X (“D 7 FK*- Xk ' ) - 
j = 0 0 

k+j 

With this choice, the rational interpolant can be written 
in a barycentric form similar to that in (2): 


r n ,n ( x ) 


I(X - Xj) 
I (X - Xj) 


Again, this form is very stable for interpolation. Its 
numerical sensitivity is measured by 


M n 


v -1 I dn,n (Xi)fii (x) | 

max X — 7~v , — 

\q n ,n(x)\ 


And there is more good news now: in the case of 
equidistant interpolation points, M n grows as slowly 
with n as the Lebesgue constant L n in (3) for polyno- 
mial interpolation in the Chebyshev zeros. The latter 
makes the technique very useful in practice. 

More practical choices for the denominator poly- 
nomial q_ n ,n(x) are possible, guaranteeing other fea- 
tures, such as rapid convergence, comonotonicity, or 
coconvexity (coconcavity). 


2.6 Sparse Interpolation 


When interpolating 

fix) = «i + 0 ( 2 X 100 


by a polynomial, the previous techniques require 101 
samples of fix) to determine that fix) is itself a poly- 
nomial, while only four values need to be computed 
from the data points, namely the two exponents 0 
and 100 and the two coefficients oq and oq. So it would 
be nice if we could solve this polynomial reconstruction 
problem from only four samples. 

The above is a special case of the more general sparse 
interpolation, which was studied as long ago as 1795, 
by Gaspard de Prony, in which the complex values fj 
and (Xj in the interpolant 
n 

fix) = X otje^ jX , <Xj,4>j e C, (10) 

i=i 

are to be determined from only 2 n samples of fix). 

While the nonlinear interpolation problems of Pade 
approximation and rational interpolation are solved by 
linearizing the conditions as in (7) and (9), the nonlin- 
ear problem of sparse interpolation is solved by sepa- 
rating the computation of the fj and the <Xj into two 
linear algebra subproblems. Let fix) be sampled at the 
equidistant points Xj = iA, i = 0, ...,2n - 1, and let 
us denote fixt) by /). We introduce the nxn Hankel 
matrices 


rdr) ._ 
rl n . — 


( fr 


\fr 


and Aj = j = 1, . . . , n. 


fr+n - 1 \ 
fr+2n-2 / 
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The A j are then retrieved as the generalized eigen- 
values of the problem 

Hn^Vj = AjHn^Vj, j = 

where the Vj are the generalized right eigenvectors. 
From the values A j, the complex numbers <fij can 
be retrieved uniquely subject to the restriction that 
jIm(c/>jA)| < 7T. In order to satisfy this restriction, the 
sampling interval A is usually adapted to the range of 
the values Im 

The (Xj are computed from the interpolation condi- 
tions 

n 

X aje*J Xt = f u t = 0, ...,2n- 1, (11) 

l=i 

either by solving the system in the least-squares sense 
or by solving a subset of n consecutive interpolation 
conditions. Note that 

e <t>jXi = A'. 

c J 

and that the coefficient matrix of (11) is therefore a 
Vandermonde matrix. 

With ft = <p(Xi) we now define 


f(x) = X fjX J ’ 
l=o 

where x; = iA, i ^ 0. Since 

/i= X«ie^=X«l A}, 

l=i l=i 

we can rewrite fix) as follows: 


fix) = X Y^xXj' 

j= i J 


( 12 ) 


So we see that fix) is itself a rational function of 
degree n - 1 in the numerator and n in the denomina- 
tor, with poles 1/A j. Hence, from Pade approximation 
theory we know (as is to be expected) that r„_ i,„(x) 
reconstructs fix)', in other words, 


r n -t,n(x) = fix) 

with 

n 

qix) = f~[(l “ x\j). 

1=1 

The partial fraction decomposition (12) is the Laplace 
transform of the exponential model (10), which ex- 
plains why this approach is known as the Pade-Laplace 
method. 

This connection between approximation theory and 
harmonic analysis is clearly not accidental. More con- 
structions from harmonic analysis, including wavelets 


and Fourier series, also provide important insights 
into central problems in approximation theory. Other 
mathematical models in which the major features of 
a data set are represented using only a few terms 
are considered in the theory of compressed sensing 
[VII. 10], 


3 Best Approximation 

When the quality of the data does not justify the impo- 
sition of an exact match on the approximating func- 
tion, or when the quantity of the data is simply over- 
whelming and depicts a trend rather than very pre- 
cise measurements, interpolation techniques are of no 
use. It is better to find a linear combination of suit- 
able basis functions that approximates the data in some 
best sense. We first discuss the existence and unique- 
ness of a best approximant and the discrete linear least- 
squares problem. How the bestness or nearness of the 
approximation is measured is then explained. Differ- 
ent measures lead to different approximants and are 
to be used in different contexts. We discuss the impor- 
tance of orthogonal basis functions and describe the 
continuous linear least-squares problem and the mini- 
max approximation. A discussion of a connection with 
Fourier series and the interpolation and approximation 
of periodic data concludes the section. 

3.1 Existence and Uniqueness 

First and foremost we discuss the existence and unique- 
ness of a best approximant p* from a finite-dimen- 
sional subspace P to an element f from a normed linear 
space V. More specifically, we ask for which of the f\~, 
$ 2 ', or f oo -norms can we guarantee that either at least 
one or exactly one solution exists to the approximation 
problem of finding p* 6 P such that 

Il/-P*ll < II/-PII, peP. 

The answer to the existence problem is affirmative 
for all three mentioned norms. To guarantee unique- 
ness of p* , either the norm or the subspace P under 
consideration must satisfy additional conditions. And 
we must distinguish between discrete and continuous 
approximation and norms. 

When V is strictly convex, in other words, when a 
sphere in V does not contain line segments, so that 

||xi — c || = r = \\x 2 - c || => 

||Axi + (1 - A)X 2 - c || < r, 0 < A < 1, 
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then the best approximant p* to / is unique. This 
applies, for instance, to the - 82 - or Euclidean norm, in 
both the discrete and continuous cases. 

In the discussion of the role of P with respect to the 
uniqueness of p* , we deal with the continuous case 
first. When a basis lbo(x ), . . . , b n (x)} for P satisfies the 
Haar condition, meaning that every linear combination 

Cln(x) = Aobo(x) + ■ ■ ■ + A n bn(x) 

has at most n zeros, then the continuous best £\ and 
best £oo approximation problems also have a unique 
solution. 

Let us now look at the discrete best approximation 
problem in somewhat more detail. We consider a large 
data set of values ft that we want to approximate by a 
linear combination of some linearly independent basis 
functions bj(x): 

Ao bo(Xi) + ■ ■ ■ + A n b n {Xi) = fu i = 0,... , m > n. 

(13) 

This (m + 1) x (n + 1) linear system can be written 
compactly as 



Ao 


fo 

' 

> 

II 

> 

II 

h n_ 

, / = 

fm_ 

’ 

II 

i 

T 

)) E R(w*+l)x(n+l) J 


Unless the right-hand side f lies in the column space 
of A, the system cannot be solved exactly. The residual 
vector is given by 

r = / - AA e R m+1 , 

and the solution A we are looking for is the one that 
solves the system best, in other words, the system that 
makes the magnitude (or norm) of the residual vec- 
tor minimal. The least-squares problem corresponds 
to using the Euclidean norm or the f? 2 -norm ||r ||2 = 
(rf + ■ ■ ■ + t"m) 1/2 to measure the residual vector, and 
the optimization problem translates to 

(A T A)A = A 1 /, 

which is a square linear system of equations called 
the normal equations [IV. 10 §7.1]. If the matrix A 
of the overdetermined linear system has maximal col- 
umn rank, then the matrix A T A is nonsingular and the 
solution is unique. 

When every (n + 1) x (n + 1) submatrlx of the matrix 
A in (14) is nonsingular, then the discrete best Zoo 
approximation problem has a unique solution as well. 
An example showing the lack of uniqueness of the best 


£\ approximation under the same condition is easy to 
find. Take 



in (14). Then the minimum of || AA - f\\ i with n = 0 and 
bo(x) = 1 is the same for all -2 ^ Ao ^ 1. 

In practice, instead of solving the normal equa- 
tions, more numerically stable techniques based on 
orthogonal transformations [IV. 10 §7.1] are ap- 
plied directly to the overdetermined system (14). These 
transformations do not alter the Euclidean norm of the 
residual vector r and hence have no impact on the 
optimization criterion. 

Let us now see whether the Euclidean norm is the 
correct norm to use. 

3.2 Choice of Norm 

If the optimal solution to the overdetermined linear 
system is the one that makes the norm ||r|| of the resid- 
ual minimal, then we must decide which norm to use to 
measure r. Although norms are in a sense equivalent, 
because they differ only by a scalar multiple depend- 
ing only on the dimension, it makes quite a differ- 
ence whether we minimize ||r||i, ||r[| 2 , or ||r||oo. Let us 
perform the following experiment. 

Using a Gaussian random number generator with 
mean p and standard deviation cr, we generate m + 1 
numbers /*. The approximation problem we consider 
is the computation of an estimate for p from the data 
points fi, where u expresses a tight or loose spread 
around p. Compare this with a real-life situation where 
the data fi are collected by performing some measure- 
ments of a magnitude p, and cr represents the accuracy 
of the measuring tool used to obtain the /;. 

In the notation of (13), we want to fit the /; by a mul- 
tiple of the basis function bo(x) = 1 because we are 
looking for the constant p. The overdetermined linear 
system takes the form 

Ao ■ 1 = fi, i = 0, ..., wi- 
lt is clear that this linear system does not have an exact 
solution. The residual vector is definitely nonzero. We 
shall see that different criteria or norms can be used to 
express the closeness of the estimate Ao for p to the 
data points ft or, in other words, the magnitude of the 
residual vector r with components ft - Ao, and that the 
standard deviation cr will also play a role. 
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If the Euclidean norm is used, then the optimal 
estimate Ag 21 is the mean of the m measurements /;: 


A 


( 2 ) 

0 


1 

m + 1 


Iff 


i = 0 


If we choose the f?i-norm ||r||i = Y.T=i If; I as a way to 
measure distances, then the value Ag 1 * that renders the 
^i-norm of the residual vector minimal is the median 
of the values Any change that makes the larger val- 
ues extremely large or the smaller values extremely 
small therefore has no impact on Ag 1 * , which is rather 
insensitive to outliers. 

When choosing as distance function the ^oo-norm 

|| r || a, = max,- = i m |rj, the optimal solution Ag to 

the problem is given by 

Ag“* = 2 ( min fi+ max fX 


This can also be understood intuitively. The value for 
Ai that makes || r || «> minimal is the one that makes the 
largest deviation minimal, so it should be right in the 
middle between the extremes. 

So the oo -norm criterion performs particularly well 
in the context of rather accurate data (in this experi- 
ment meaning small standard deviation a) that suffer 
relatively small input errors (such as roundoff errors). 
When outliers or additional errors (such as from man- 
ual data input) are suspected, use of the i'l-norm is 
recommended. If the measurement errors are believed 
to be normally distributed with mean zero, then the £ 2 - 
norm is the usual choice. Approximation problems of 
this type are therefore called least-squares problems. 


3.3 Orthogonal Basis Functions 

In the same way that we prefer to draw a graph using an 
orthogonal set of axes (the smaller the angle between 
the axes, the more difficult it becomes to make a clear 
drawing), it is preferred to use a so-called orthogonal 
set of basis functions bj{x ) in (13). Orthogonal basis 
functions bj(x) can tremendously improve the condi- 
tioning or sensitivity of the problem (14). They are also 
useful in continuous least-squares problems. 

The notion of orthogonality in a function space par- 
allels that of orthogonality in the vector space K k : 
for a positive weight function w(x) defined on the 
interval [a,b], we say that the functions / and g are 
iv -orthogonal if 

ffc 

{f,g)w=\ f(x)g(x)w(x)Ax = 0. 

J a 


The function w(x) can assign a larger weight to cer- 
tain parts of the interval [a, b]. For instance, the func- 
tion w(x) = 1/Vl - x 1 on [-1, 1] assigns more weight 
toward the endpoints of the interval. 

For w(x) = 1 and [a, b] = [—1,1], a sequence of 
orthogonal polynomials L*(x) satisfying 

| Lj(x)Lk(x)dx = 0, j £ k, 

is given by 

Lq(x) = 1, 1 1 (x) = x , 

Ii+i(x) = 21 + 1 xLi(x) - If-i(x), i>l. 

1 + 1 1 + 1 

The polynomials I,(x) are called the Legendre polyno- 
mials. For w(x) = 1/Vl - x 2 and [a, b] = [—1,1], a 
sequence of orthogonal polynomials T/(x) satisfying 

f 1 1 

j ^ Tj(x)Tk(x)-j==z dx = 0, j + k, 

is given by 

7b(x) = 1, 7i(x) = x, 

Tj + i(x) = 2xTj(x) — Ti_i(x), i ^ 1. 


The polynomials Tj(x) are called the Chebyshev poly- 
nomials (of the first kind). They are also very useful in 
(continuous as well as discrete) least-squares problems, 
as discussed below. 

When the polynomials are to be used on an inter- 
val [a, b\ different from [-1, 1], the simple change of 
variable 


b - a 


( a + b\ 
{ X 


transforms the interval [a, b] to the interval [-1, 1], on 
which the orthogonal polynomials are defined. 

Orthogonal polynomials also satisfy the Haar condi- 
tion, so every linear combination 


< 2 n(x) = a 0 p 0 (x) + ■ ■ ■ + a n p n (x) 

of the orthogonal polynomials p;(x) of degree i = 
0 ,...,n has at most n zeros. Therefore, orthogonal 
polynomials are also a suitable basis in which to 
express an interpolating function; the system of inter- 
polation conditions 
n 

X ajPj(Xi) = ft, i = 0, . . . , n, 

3 = 0 

has a coefficient matrix that is guaranteed to be non- 
singular for mutually distinct points Xj. 

The importance of orthogonal basis functions in 
interpolation and approximation cannot be overstated. 
Problems become numerically better conditioned and 
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formulas simplify. For instance, the Chebyshev poly- 
nomials Ti(x) also satisfy the discrete orthogonality 


n , 

X Tj(xi)T k (xi) = (1 + S k0 ) n —8 jk , 

i=0 z 

j,k = 0 , 1 ,..., 


where <5y is the kronecker delta [1.2 §2, table 3] and 


Xi = cos 


i (2t + 1 )tt \ 

V 2(n+ 1) ) 


are the zeros of the Chebyshev polynomial T n+ i. When 
expressing the polynomial interpolant p n ix) in (1) of 
degree n in the Chebyshev basis, 


n 

PnW = X O-jTj(x), 
3 = 0 


an easy explicit formula for p n ix) interpolating the 
values fi at the points Xi can be given: 


Pnix) 


sr= 0 /^ 

n + 1 


y ( 2I? =0 fiTj(Xi) 
^ V n + 1 



Another elegant explicit formula, based on the contin- 
uous orthogonality property of the Chebyshev polyno- 
mials, is given in (16). 

The practical utility of Chebyshev polynomials is 
illustrated by the open source Chebfun software sys- 
tem (see www.chebfun.org) for numerical computation 
with functions, which is built on piecewise-polynomial 
interpolation at the extrema of Chebyshev polynomials, 
or what is equivalent (via the fast Fourier transform 
[II. 10]), expansions in Chebyshev polynomials. 


3.4 Chebyshev Series 

Let us now choose the basis functions bjix) = Tjix) 
and look for the coefficients A j that make the L^-norm 
of 

n 

fix) - X AjTj(x), -1 ^x < 1, 

3=0 

minimal for fix) defined on [-1,1], for simplicity. 
We are looking for the polynomial p n (x) of degree n 
that is closest to fix), where we measure the distance 
between the functions using the inner product 

Wf-pnWl = if -PnJ -Pn) 

C 1 if~Pn) 2 ix) A „ 

= J-t Vl - x 2 Ax ■ (15) 

This is a continuous least-squares problem because the 
norm of a function is minimized instead of the norm 


of a finite-dimensional vector. Since 

/ X'V/>./ Xv':/ 

3=0 3=0 


3=0 


Xif’Tj/WTjh ) 2 


3=0 


+ Xdf’Tj/UTjlM-AjUTjh) 2 , 

3=0 

in which only the last sum of squares depends on A j, 
the minimum is attained for the so-called Chebyshev 
coefficients 

A j = {f,Tj)HTj,Tj). (16) 


The partial sum of degree n of the Chebyshev series 
development of a function, 

fM = X ywyr Tjix), 

3=0 {1 J’ 

is therefore the best polynomial approximation of 
degree n to fix) in the £2 sense. Since 


n 


fix) - X 

3=0 


UjTj) 

iTj.Tj) 


Tjix) 


00 

< I 

j=n + 1 


if ’ Tj) 

iTj.Tj) 


this error can be made arbitrarily small when the series 
of Chebyshev coefficients converges absolutely, a con- 
dition that is automatically satisfied for functions that 
are continuously differentiable in [-1, 1], 

The above technique can be generalized to any weight 
function and its associated family of orthogonal poly- 
nomials: when switching the weight function, the norm 
criterion (15) changes and the orthogonal basis is 
changed. 

The Chebyshev series partial sums are good overall 
approximations to a function fix) defined on the inter- 
val [- 1 , 1 ] (or [a , b] after a suitable change of variable). 
To illustrate this, in figure 6 we compare, for the func- 
tion fix) = arctan(x), the error plots of the Chebyshev 
series partial sum of degree 9 with the Taylor series par- 
tial sum of the same degree. Its Chebyshev series and 
Taylor series developments are, respectively, given by 


t , v, 1 \i 2(V2 - l) 2i+1 , , 

arctan(x) = X (_1) Tfffi T 2 i+i(x), 

i=0 

°° J 

arctan(x) = Y (-1) 1 — — -x 2l+1 . 

n 2l + 1 

1=0 

Although explicit formulas for the Taylor series 
expansion of most elementary and special functions 
are known, the same is not true for Chebyshev series 
expansions. For most functions, the coefficients (16) 



IV. 9. Approximation Theory 


259 



Figure 6 Error plots of Chebyshev (solid line) and Taylor 
(dashed line) partial sums of degree 9 for arctan(.v). 


have to be computed numerically because no analytic 
expression for (16) can be given. 


3.5 The Minimax Approximation 


Instead of minimizing the ^-distance (15) between a 
function f(x ) e C([a, b ]) and a polynomial model for 
f(x), we can consider the problem of minimizing the 
oo -distance. Every continuous function fix) defined 
on a closed interval [a, b] has a unique so-called mini- 
max polynomial approximant of degree n. This means 
that there exists a unique polynomial p n = p„ of 
degree dp n ^ n that minimizes 


11/ - Pn II oo = max 
xe[a,b ] 


fix) 


■ X *JX J 
i= o 


(17) 


More generally, if the set of basis functions {b o(x), , 
b n (x ) } satisfies the Haar condition, then there exists a 
unique approximant 


q*(x) = A$b 0 (x) + --- + **b„(x) 


that minimizes 

11/ — 9n II oo 


max 

xs[a,b] 


n 

fix) - X hjbjix) 


j=o 


The minimum is attained and is not an infimum. When 
bj(x) = xf the polynomial p„ix) is computed using 
the Remez algorithm, which is based on its characteri- 
zation given by the alternation property of the function 



if ~ Pn ) i x ) ! Pn is th e best polynomial approximant of 
degree n if the error \\f - p* IU is attained by the func- 
tion f - Pn in at least n + 2 points yo, . . . , y n +i in the 
interval [a, b] and this with alternating sign, meaning 
that 

3To > yi> ■ ■ ■ > y n +i 6 [a, b] : 

if-p^)iyi) = si-mf-p*\u 

s = ±1, i = 0 n + 1. 

The Remez algorithm is an iterative procedure, and the 
polynomial p* (x) is obtained as the limit. The above 
characterization is also called the equioscillation prop- 
erty. We illustrate it in figure 7, where we plot the error 
e x - p^ ix) on [-1,1]. Compare this figure with fig- 
ures 8 and 9, in which the error oscillates but does not 
equioscillate. 

How much better the (nonlinear) minimax approxi- 
mation is, compared with a linear approximation pro- 
cedure of degree n such as polynomial interpolation, 
Chebyshev approximation, and the like, is expressed 
by the norm ||P„|U = supi^u^! ||P n (/)ll<» of the lin- 
ear operator P n that associates with a function its par- 
ticular linear approximant. Since P n iPn) — Pn< we 
have 


11/ — Pnlf)\\co = Il/-P^ + Pi-Pn(/)IU 
= ||/-p*+P„(p*-/)IL 
^ (1 + l|Pnlloo)ll/-p*ll». 
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Figure 8 Error plot of the Chebyshev Figure 9 Error plot of the polynomial interpolant 

partial sum of degree 3 for e*. of degree 3 in the Chebyshev zeros for e x . 


The value || P n || is called the Lebesgue constant. When 
dealing with polynomial interpolation, i|P n ||oo = L n , 
with L n given by (3). 

The quality of the continuous best £ 2 approximant 
(such as that in figure 8) is expressed by 


1 r n 

sin((n + j)0) 

TT Jo 

sinljd) 


which again grows only logarithmically with n. Contin- 
uous £ 2 polynomial approximation and Lagrange inter- 
polation in the Chebyshev zeros (such as in figure 9) can 
therefore be considered near-best polynomial approxi- 
mants. 

3.6 Fourier Series 

Let us return to a discrete approximation problem. Our 
interest is now in data exhibiting some periodic behav- 
ior, such as the description of rotation-invariant geo- 
metric figures or the sampling of a sound waveform. A 
suitable set of orthogonal basis functions is the set 

1, cos(x), cos(2x), ..., cos(nx), 
sin(x), sin(2x) sin(nx) 

as long as the distinct data points x* with i = 0 m 

are evenly spaced on an interval of length 2 t t because 
then, for any two basis functions by (x) and bk (x) from 



(18), we have 


ra 

( bj,b k ) = X bj(Xi)bk(Xi) = — — S Jk , j 4= 0, 

i=0 z 

m 

(b 0 ,b k ) = £ bo(xi)b k (xi) = (m + l)5 0 fc. 
i = 0 


where Sij is the Kronecker delta. For simplicity we 
assume that the real data are given on 

[0, 2tt) at 


Xq = 0, Xi 


2 tt 4tt 

TW’ x 2 = , 

m m 


2mn 
m + 1 " 


Because of the periodicity, the value at x m+ \ = 2 tt 
equals the value at xo and therefore it is not repeated. 
Let m ^ 2 n and consider the approximation 


^ n n 

Y + X a 2 ., cos(jx) + X \ 2 j-i sin (jx). 
j= 1 1=1 


The values 

2 m 

A 2 j= -^/i C0S (jTj), j = 0, . . . , n, 

m + 1 — 

1=0 

2 m 

= wrcrr X ft sin Uxi), j = 1 , ■ ■ ■ , n, 

rn + 1 1=0 

minimize the £2 -norm 



n n v 2 

X A 2 j cos(jXi) + 2] A 2 /-i sin(jXj) - fA 
j= 1 i=i y 
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When m = 2 n, the minimum is zero because the least- 
squares approximant becomes a trigonometric inter- 
polant. Note that we have replaced Ao by ^Ao because 
{bj, bj) is smaller by a factor of 2 when j = 0. 

If we form a single complex quantity Aj = A 2 / — i A 2 /— i 
for j = 1, . . . , n, where i = V^T, these summations can 
be computed using a discrete Fourier transform that 
maps the data j\ at the points Xi to the Aj\ 

, m 

A< = — Y 1) j = Q n . 

3 m + 1 f- n 3 J 


The functions in (18) also satisfy a continuous 
orthogonality property: 


rZ7T 

J cos (jx) cos (kx) dx = TtSjk, j £ 0, 

f 2n 

J q cos (jx) cos (kx) dx = —5jk, j = 0, 

r2.1T 

cos (jx) sin (kx) dx = 0, 

Jo 

r2n 

J sin (jx) sin (kx) dx = TtSjk- 


They can therefore be used for a Fourier series repre- 
sentation of a function /(x): 


, oo oo 

fix) = -y + X A 2; cos(jx) + X A 2 j_i sin(j'x), 

S = i j = i 

</(x),cos(jx)> 

A 2j = — , J = 0,1,..., 

</(x),sin(jx)) 

A2 j-1 = , J = 1,2,.... 

7T 

The partial sum of trigonometric degree n of this series 
minimizes the ^ 2 -norm 



Ao 

2 


y \ 2 j cos(jx) 

J=1 



4 Multivariate Interpolation 
and Approximation 


A wide range of multivariate generalizations of the 
above interpolation and approximation problems to 
functions of several variables x, y, z, . . . have therefore 
been developed: polynomial and rational ones, discrete 
and continuous ones. 

A fundamental issue in multivariate interpolation 
and approximation is the so-called curse of dimen- 
sionality [1.3 §2], meaning that, when the dimension- 
ality increases, the number of different combinations 
of variables grows exponentially in the dimensional- 
ity. A polynomial of degree 3 in eight variables already 
has 165 terms! Another problem is that the polynomial 
basis of the multinomials does not satisfy the ffaar 
condition and there is no easy generalization of this 
property to the multivariate case. 

In order to counter both problems, the theory of 
radial basis functions has been developed. 

Let us consider data /,■ given at corresponding mul- 
tidimensional vectors Xj, i = 0, . . . , n. The data vectors 
Xi do not have to form a grid but can be scattered. A 
radial basis function is a function whose value depends 
only on the distance from the origin or from another 
point, so its variable is r = || x || 2 or r = \\x - c|| 2 . 
When centering a radial basis function B(r ) at each data 
point, there is a basis function B(||x-Xj|| 2 ) for each / = 
0, . . . ,n. The coefficients aj in a radial basis function 
interpolant Ej=o a j-B(l|x - Xj|| 2 ) are then computed 
from the linear system 
n 

y ajBiWxj - Xj || 2 ) =/j, i = 0, . . . , n. 

j=o 

Several commonly used types of radial basis functions 
B(r) guarantee nonsingular systems of interpolation 
conditions, in other words, nonsingular matrices 

(uaixi-xjiiz))^ n . 

We mention as examples the Gaussian, multiquadric, 
inverse multiquadric, and a member of the Matern 
family, respectively, given by 

B(r) = e~ <5r)2 , 

B(r) = y]l + {sr) 2 , 

B(r) = l/ijl + (sr) 2 , 

B(r) = (1 + sr)e~ sr . 


The approximation of multivariate functions — contin- 
uous ones as well as discontinuous ones— is an active 
held of research due to its large variety of applica- 
tions in the computational sciences and engineering. 


The real parameter 5 is called a shape parameter. As can 
be seen in figure 10, different choices for 5 greatly influ- 
ence the shape of B(r). Smaller shape parameters cor- 
respond to a hatter or wider basis function. The choice 
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Figure 10 Gaussian basis function with 
(a) 5 = 0.4, (b) 5 = 1, and (c) 5 = 3. 


of 5 has a significant impact on the accuracy of the 
approximation, and finding an optimal shape param- 
eter is not an easy problem. Another concern is the 
numerical conditioning of the radial basis interpolation 
problem, especially when the shape parameter is small. 
The user often has to find the right trade-off between 
accuracy and conditioning. 


The concept of radial basis function also allows one 
to work mesh free. In a mesh or grid of data points each 
point has a fixed number of neighbors, and this connec- 
tivity between neighbors is used to define mathemat- 
ical operators such as the derivative and the divided 
difference. Multivariate mesh-free methods allow one 
to generalize these concepts and are especially use- 
ful when the mesh is difficult to maintain (e.g., in 
high- dimensional problems, when there is nonlinear 
behavior, discontinuities, singularities, etc.). 

5 Future Research 

Especially in multivariate approximation theory, many 
research questions remain unsolved: theory for the 
multivariate case is not nearly as well developed as it 
is for the univariate case. But researchers continue to 
push the boundaries in the one-variable case as well: 
what is the largest function class or the most gen- 
eral domain for which a result holds, for example? 
Many papers can be found on Jackson-type inequalities 
(approximation error bounds in terms of the function’s 
smoothness), Bernstein-type inequalities (bounds on 
derivatives of polynomials), and convergence proper- 
ties of particular approximations (polynomial, spline, 
rational, trigonometric), to name just a few fundamen- 
tal topics. The development of orthogonal basis func- 
tions, on disconnected regions or in more variables, 
also deserves (and is getting) a lot of attention. 
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IV. 10 Numerical Linear Algebra and 
Matrix Analysis 

Nicholas J. Higham 

Matrices are ubiquitous in applied mathematics. Ordi- 
nary differential equations (ODEs) and partial differen- 
tial equations (PDEs) are solved numerically by finite- 
difference or finite-element methods, which lead to sys- 
tems of linear equations or matrix eigenvalue prob- 
lems. Nonlinear equations and optimization problems 
are typically solved using linear or quadratic models, 
which again lead to linear systems. 

Solving linear systems of equations is an ancient task, 
undertaken by the Chinese around 1 C.E., but the study 
of matrices per se is relatively recent, originating with 
Arthur Cayley’s 1858 “A memoir on the theory of matri- 
ces.” Early research on matrices was largely theoret- 
ical, with much attention focused on the development 
of canonical forms, but in the twentieth century the 
practical value of matrices started to be appreciated. 
Heisenberg used matrix theory as a tool in the develop- 
ment of quantum mechanics in the 1920s. Early propo- 
nents of the systematic use of matrices in applied math- 
ematics included Frazer, Duncan, and Collar, whose 
1938 book Elementary Matrices and Some Applications 
to Dynamics and Differential Equations emphasized the 
important role of matrices in differential equations and 
mechanics. The continued growth of matrices in appli- 
cations, together with the advent of mechanical and 
then digital computing devices, allowing ever larger 
problems to be solved, created the need for greater 
understanding of all aspects of matrices from theory 
to computation. 

This article treats two closely related topics: matrix 
analysis, which is the theory of matrices with a focus 
on aspects relevant to other areas of mathematics, and 
numerical linear algebra (also called matrix computa- 
tions), which is concerned with the construction and 
analysis of algorithms for solving matrix problems as 
well as related topics such as problem sensitivity and 
rounding error analysis. 

Important themes that are discussed in this article 
include the matrix factorization paradigm, the use of 
unitary transformations for their numerical stability, 
exploitation of matrix structure (such as sparsity, sym- 
metry, and definiteness), and the design of algorithms 
to exploit evolving computer architectures. 

Throughout the article, uppercase letters are used for 
matrices and lowercase letters for vectors and scalars. 


Matrices and vectors are assumed to be complex, unless 
otherwise stated, and A* = {a]i) denotes the conjugate 
transpose of A = ( ay ). An unsubscripted norm || ■ || 
denotes a general vector norm and the corresponding 
subordinate matrix norm. Particular norms used here 
are the 2-norm || ■ II 2 and the Frobenius norm || ■ ||f- All of 
these norms are defined in the language of applied 
mathematics [1.2 §§19.3, 20]. The notation “i = 1 : n” 
means that the integer variable i takes on the values 
1,2, ...,n. 


1 Nonsingularity and Conditioning 


Nonsingularity of a matrix is a key requirement in many 
problems, such as in the solution of n linear equations 
in n unknowns. For some classes of matrices, nonsingu- 
larity is guaranteed. A good example is the diagonally 
dominant matrices. The matrix A 6 C nxn is strictly 
diagonally dominant by rows if 

X lay I < laid, i=l'-n, 


and strictly diagonally dominant by columns if A* is 
strictly diagonally dominant by rows. Any matrix that 
is strictly diagonally dominant by rows or columns 
is nonsingular (a proof can be obtained by applying 
Gershgorin’s theorem in section 5.1). 

Since data is often subject to uncertainty, we wish 
to gauge the sensitivity of problems to perturbations, 
which is done using condition numbers [1.2 §22]. An 
appropriate condition number for the matrix inverse is 


lim sup 

||AA||<£||A|| 


|| (A + AAV 1 - A- 1 
f||A-i|| 


This expression turns out to equal k(A) = ||A|| ||A _1 ||, 
which is called the condition number of A with respect 
to inversion. This condition number occurs in many 
contexts. For example, suppose A is contaminated 
by errors and we perform a similarity transformation 
X~ l (A + E)X = X~ 1 AX + F. Then ||F|| = \\X~ l EX || ^ 
k(X)||£|| and this bound is attainable for some E. 
Hence the errors can be multiplied by a factor as large 
as k(X). We therefore prefer to carry out similarity 
and other transformations with matrices that are well- 


conditioned, that is, ones for which k(X) is close to its 
lower bound of 1. By contrast, a matrix for which k is 
large is called ill-conditioned. For any unitary matrix X, 
K 2 (X) = 1, so in numerical linear algebra transforma- 
tions by unitary or orthogonal matrices are preferred 
and usually lead to numerically stable algorithms. 

In practice we often need an estimate of the matrix 
condition number k(A) but do not wish to go to the 
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expense of computing A -1 in order to obtain it. Fortu- 
nately, there are algorithms that can cheaply produce a 
reliable estimate of i<(A) once a factorization of A has 
been computed. 

Note that the determinant, det(A), is rarely com- 
puted in numerical linear algebra. Its magnitude gives 
no useful information about the conditioning of A, not 
least because of its extreme behavior under scaling: 
det(aA) = cx n det(A). 

2 Matrix Factorizations 

The method of Gaussian elimination (GE) for solving a 
nonsingular linear system Ax = b of n equations in 
n unknowns reduces the matrix A to upper triangular 
form and then solves for x by substitution [1.3 §7.1]. 
GE is typically described by writing down the equations 
a^ +1) = a ® - aik a kj/a-M (and similarly for b) that 
describe how the starting matrix A = A <J) = («!]') 
changes on each of the n - 1 steps of the elimina- 
tion in its progress toward upper triangular form U. 
Working at the element level in this way leads to a pro- 
fusion of symbols, superscripts, and subscripts that 
tend to obscure the mathematical structure and hin- 
der insights being drawn into the underlying process. 
One of the key developments in the last century was 
the recognition that it is much more profitable to work 
at the matrix level. Thus the basic equation above is 
written as A (fc+1) = MkA (k \ where Mk agrees with the 
identity matrix except below the diagonal in the kth 
column, where its (i,fe) element is mik = -a® /a ^ , 
i = k + 1 -n. Recurring the matrix equation gives U := 
A<«) = M n - 1 ■ ■ ■ MiA. Taking the Mk matrices over to 
the left-hand side leads, after some calculations, to the 
equation A = LU, where L is unit lower triangular, with 
(i, k) element m^. The prefix “unit” means that L has 
ones on the diagonal. 

GE is therefore equivalent to factorizing the matrix 
A as the product of a lower triangular matrix and an 
upper triangular matrix— something that is not at all 
obvious from the element-level equations. Solving the 
linear system Ax = b now reduces to the task of solving 
the two triangular systems Ly = b and Ux = y. 

Interpreting GE as LU factorization separates the 
computation of the factors from the solution of the tri- 
angular systems. It is then clear how to solve efficiently 
several systems Ax* = bi, i = 1 : r, with different right- 
hand sides but the same coefficient matrix A: compute 
the LU factors once and then reuse them to solve for 
each x, in turn. 


This matrix factorization 1 viewpoint dates from 
around the 1940s and has been extremely successful 
in matrix computations. In general, a factorization is a 
representation of a matrix as a product of “simpler” 
matrices. Factorization is a tool that can be used to 
solve a variety of problems, as we will see below. 

Two particular benefits of factorizations are unity 
and modularity. GE, for example, can be organized 
in several different ways, corresponding to different 
orderings of the three nested loops that it comprises, 
as well as the use of different blockings of the matrix 
elements. Yet all of them compute the same LU factor- 
ization, carrying out the same mathematical operations 
in a different order. Without the unifying concept of a 
factorization, reasoning about these GE variants would 
be difficult. 

Modularity refers to the way that a factorization 
breaks a problem down into separate tasks that can 
be analyzed or programmed independently. To carry 
out a rounding error analysis of GE, we can analyze the 
LU factorization and the solution of the triangular sys- 
tems by substitution separately and then put the analy- 
ses together. The rounding error analysis of substitu- 
tion can be reused in the many other contexts in which 
triangular systems arise. 

An important example of the use of LU factoriza- 
tion is in iterative refinement. Suppose we have used 
GE to obtain a computed solution x to Ax = b in 
floating-point arithmetic. If we form r = b - Ax and 
solve Ae = r, then in exact arithmetic y = x + e is 
the true solution. In computing e we can reuse the LU 
factors of A, so obtaining y from x is inexpensive. In 
practice, the computation of r, e, and y is subject to 
rounding errors so the computed y is not equal to x. 
But under suitable assumptions y v\411 be an improved 
approximation and we can iterate this refinement pro- 
cess. Iterative refinement is particularly effective if r 
can be computed using extra precision. 

Two other key factorizations are the following. 

• Cholesky factorization : for Hermitian positive-def- 
inite A G C nxn , A = R*R, where R is upper tri- 
angular with positive diagonal elements, and this 
factorization is unique. 

• QR factorization : for A G C mxn with m'y n, A = 

QR, where Q G C mxm is unitary (Q*Q = I m ) and 
R G C mxn is upper trapezoidal, that is, R = ] 

with R\ G C nxn upper triangular. 


1. Or decomposition— the two terms are essentially synonymous. 
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These two factorizations are related: if A e C mxn with 
m ^ n has full rank and A = QR is a QR factorization, 
in which without loss of generality we can assume that 
R has positive diagonal, then A* A = R*R, so R is the 
Cholesky factor of A* A. 

The Cholesky factorization can be computed by what 
is essentially a symmetric and scaled version of GE. The 
QR factorization can be computed in three main ways, 
one of which is the classical Gram-Schmidt orthogonal- 
ization. The most widely used method constructs Q as 
a product of Householder reflectors, which are unitary 
matrices of the form H = I - 2vv* /(v*v), where v 
is a nonzero vector. Note that H is a rank-1 perturba- 
tion of the identity, and since it is Hermitian and uni- 
tary, it is its own inverse; that is, it is involutory. The 
third approach builds Q as a product of Givens rota- 
tions, each of which is a 2 x 2 matrix [ _f 5 s c ] embedded 
into two rows and columns of an m x m identity matrix, 
where (in the real case) c 2 + 5 2 = 1. 

The Cholesky factorization helps us to make the 
most of the very desirable property of positive-definite- 
ness. For example, suppose A is Hermitian positive-def- 
inite and we wish to evaluate the scalar a = x*A~ 1 x. 
We can rewrite it as x* (R*R)~ 1 x = (x*R~ l )(R~*x) = 
z* z, where z = R~*x. So once the Cholesky factoriza- 
tion has been computed we need just one triangular 
solve to compute a, and of course there is no need to 
explicitly invert the matrix A. 

A matrix factorization might involve a larger num- 
ber of factors: A = N 1 N 2 ■ ■ ■ Nk, say. It is immediate 
that A T = ■ ■ ■ Nj . This factorization of the 

transpose may have deep consequences in a particu- 
lar application. For example, the discrete Fourier trans- 
form is the matrix-vector product y = F n x, where the 
nxnmatrixF n has (p,q) element exp(-2m(p-l)(q- 
1 )/n); F n is a complex, symmetric matrix. The fast 
Fourier transform [11.10] (FFT) is a way of evaluating 
y in 0(n log 2 n ) operations, as opposed to the 0(n 2 ) 
operations that are required by a standard matrix- 
vector multiplication. Many variants of the FFT have 
been proposed since the original 1965 paper by Coo- 
ley and Tukey. It turns out that different FFT vari- 
ants correspond to different factorizations of F n with 
k = log 2 n sparse factors. Some of these methods 
correspond simply to transposing the factorization in 
another method (recall that F^ = F n ), though this 
was not realized when the methods were developed. 
Transposition also plays an important role in auto- 
matic differentiation [VI./]; the so-called reverse or 
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adjoint mode can be obtained by transposing a matrix 
factorization representation of the forward mode. 

The factorizations described in this section are in 
“plain vanilla” form, but all have variants that incor- 
porate pivoting. Pivoting refers to row or column inter- 
changes carried out at each step of the factorization as 
it is computed, introduced either to ensure that the fac- 
torization succeeds and is numerically stable or to pro- 
duce a factorization with certain desirable properties 
usually associated with rank deficiency. For GE, partial 
pivoting is normally used: at the start of the kth stage of 
the elimination, an element of largest modulus in 
the kth column below the diagonal is brought into the 
(k,k) (pivot) position by interchanging rows k and r. 
Partial pivoting avoids dividing by zero (if =0 after 

the interchange then the pivot column is zero below the 
diagonal and the elimination step can be skipped). More 
importantly, partial pivoting ensures numerical stabil- 
ity (see section 8). The overall effect of GE with partial 
pivoting is to produce an LU factorization PA = LU, 
where P is a permutation matrix. 

Pivoted variants of Cholesky factorization and QR 
factorization take the form P T AP = R*R and AP = 
Q[ o ], where P is a permutation matrix and R satisfies 
the inequalities 

i 

Iml 2 ^ X \ r ij \ 2 > j = k + l : n , k = l : n . 

i=k 

If A is rank deficient then R has the form R = [ R q ! ] 
with Rn nonsingular, and the rank of A is the dimen- 
sion of R n . Equally importantly, when A is nearly rank 
deficient this tends to be revealed by a small trailing 
diagonal block of R. 

A factorization of great importance in a wide variety 
of applications is the singular value decomposition 
[11.32] (SVD) of A e C mxn : 

A = USV*, I = diag(o-i,CT 2 ,...,o>) £ M mx, \ (1) 

where p = minim, n), U e £ mxm and V e C nxn are 
unitary, and the singular values cti satisfy or ^ cr 2 ^ 
■ ■ ■ ^ <Jp ^ 0. For a square A (m = n), the 2-norm 
condition number is given by k 2 (A) = a\/a n . 

The polar decomposition of A e C mxn with m ^ n 
is a factorization A = UH in which U e C mxn has 
orthonormal columns and H e C nxn is Hermitian 
positive-semidehnite. The matrix H is unique and is 
given by ( A*A ) 1/2 , where the exponent 1/2 denotes the 
principal square root [11.14], while U is unique if A 
has full rank. The polar decomposition generalizes to 
matrices the polar representation z = re 10 of a complex 
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number. The Hermitian polar factor H is also known as 
the matrix absolute value, \A\, and is much studied in 
matrix analysis and functional analysis. 

One reason for the importance of the polar decom- 
position is that it provides an optimal way to orthogo- 
nalize a matrix; a result of Fan and Hoffman (1955) says 
that U is the nearest matrix with orthonormal columns 
to A in any unitarily invariant norm (a unitarily invari- 
ant norm is one with the property that ||[/AV|| = ||A|| 
for any unitary U and V\ the 2-norm and the Frobe- 
nius norm are particular examples). In various appli- 
cations a matrix A e R nxn that should be orthogonal 
drifts from orthogonality because of rounding or other 
errors; replacing it by the orthogonal polar factor U is 
then a good strategy. 

The polar decomposition also solves the orthogonal 
Procrustes problem, for A,Be c mxn , 

min{||A - BQIIf: Q e C nxn , Q*Q = I}, 

for which any solution Q is a unitary polar factor of 
B*A. This problem comes from factor analysis and mul- 
tidimensional scaling in statistics, where the aim is to 
see whether two data sets A and B are the same up to 
an orthogonal transformation. 

Either of the SVD and the polar decomposition can 
be derived, or computed, from the other. Histori- 
cally, the SVD came first (Beltrami, in 1873), with the 
polar decomposition three decades behind (Autonne, 
in 1902). 

3 Distance to Singularity and 
Low-Rank Perturbations 

The question commonly arises of whether a given per- 
turbation of a nonsingular matrix A preserves nonsin- 
gularity. In a sense, this question is trivial. Recalling 
that a square matrix is nonsingular when all its eigen- 
values are nonzero, and that the product of two matri- 
ces is nonsingular unless one of them is singular, from 
A + A A = A(I + A" J AA) we see that A + AA is nonsin- 
gular as long as A _1 AA has no eigenvalue equal to — 1. 
However, this is not an easy condition to check, and in 
practice we may not know A A but only a bound for its 
norm. Since any norm of a matrix exceeds the modulus 
of every eigenvalue, a sufficient condition for A + A A 
to be nonsingular is that ||A _1 AA|| < 1, which is cer- 
tainly true if || A' 1 1| ||AA|| < 1. This condition can be 
rewritten as the inequality || AA || /|| A || < x(A)" 1 , where 
k(A) = || A|| || A -1 1| ^ 1 is the condition number intro- 
duced in section 1. It turns out that we can always 
find a perturbation A A such that A + A A is singular 


and || AA|| / 1| A|| = k(A) V It follows that the relative 
distance to singularity 

d(A ) = min{ 1 1 AA 1 1 / 1 1 A 1 1 : A + AA is singular} (2) 

is given by d(A) = /<(A) — 1 . This reciprocal relation 
between problem conditioning and the distance to a 
singular problem (one with an infinite condition num- 
ber) is common to a variety of problems in linear alge- 
bra and control theory, as shown by James Demmel in 
the 1980s. 

We may want a more refined test for whether A + A A 
is nonsingular. To obtain one we will need to make 
some assumptions about the perturbation. Suppose 
that A A has rank 1: A A = xy * for some vectors x and 
y. From the analysis above we know that A + A A wall 
be nonsingular if A" : AA = A" : xy* has no eigenvalue 
equal to - 1. Using the fact that the nonzero eigenvalues 
of AB are the same as those of BA for any conformable 
matrices A and B, we see that the nonzero eigenvalues 
of (A" : x)y * are the same as those of y*A~ 1 x. Hence 
A + xy* is nonsingular as long as y*A" : x f= -1. 

Now that we know when A + xy* is nonsingular, we 
might ask if there is an explicit formula for the inverse. 
Since A + xy* = A(I + A~ 1 xy*), we can take A = I 
without loss of generality. So we are looking for the 
inverse of B = I + xy*. One way to find it is to guess 
that B = I + 0xy* for some scalar 0 and equate the 
product with B to I, to obtain 0(1 +y*x) + 1 = 0. Thus 
(I + xy*)" 1 = I - xy * /( 1 + y*x). The corresponding 
formula for (A + xy*)" 1 is 

(A + xy*)" 1 = A" 1 - A" 1 xy*A" 1 /( 1 + y*A" 1 x), 

which is known as the Sherman-Morrison formula. 
This formula and its generalizations originated in the 
1940s and have been rediscovered many times. The 
corresponding formula for a rank-p perturbation is 
the Sherman-Morrison-Woodbury formula : for U, V 6 

qyixp 

(A + UV*)- 1 = A" 1 - A" 1 (7(J + V*A" 1 U)" 1 V*A" 1 . 

Important applications of these formulas are in opti- 
mization, where rank-1 or rank-2 updates are made 
to Hessian approximations in quasi-newton meth- 
ods [IV. 1 1 §4.2] and to basis matrices in the simplex 
method [IV.ll §3.1]. More generally, the task of updat- 
ing the solution to a problem after a coefficient matrix 
has undergone a low-rank change, or has had a row or 
column added or removed, arises in many applications, 
including signal processing [IV. 35], where new data 
is continually being received and old data is discarded. 
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The minimal distance in the definition (2) of the dis- 
tance to singularity d (A) can be shown to be attained 
for a rank-1 matrix A A. Rank-1 matrices often feature 
in the solutions of matrix optimization problems. 

4 Computational Cost 

In order to compare competing methods and predict 
their practical efficiency, we need to know their com- 
putational cost. Traditionally, computational cost has 
been measured by counting the number of scalar arith- 
metic operations and retaining only the highest-order 
terms in the total. For example, using GE we can solve 
a system of n linear equations in n unknowns with 
n 3 / 3 + 0(n 2 ) additions, n 3 / 3 + 0(n 2 ) multiplications, 
and O(n) divisions. This is typically summarized as 
2n 3 /3 flops, where a flop denotes any of the scalar 
operations +, -, *, /. Most standard problems involv- 
ing nxn matrices can be solved with a cost of order n 3 
flops or less, so the interest is in the exponent (1, 2, or 3) 
and the constant of the dominant term. However, the 
costs of moving data around a computer’s hierarchi- 
cal memory and the costs of communicating between 
different processors on a multiprocessor system can 
be equally important. Simply counting flops does not 
therefore necessarily give a good guide to performance 
in practice. 

Seemingly trivial problems can offer interesting chal- 
lenges as regards minimizing arithmetic costs. For 
matrices A, B, and C of any dimensions such that the 
product ABC is defined, how should we compute the 
product? The associative law for matrix multiplication 
tells us that (AB)C = A(BC), but this mathematical 
equivalence is not a computational one. To see why, 
note that for three vectors a,b,c e M n we can write 

(ab*)c = a(b*c). 

nxn lxl 

Evaluation of the left-hand side requires 0(n 2 ) flops, as 
there is an outer product ab* and then a matrix-vector 
product to evaluate, while evaluation of the right-hand 
side requires just O(n) flops, as it involves only vec- 
tor operations: an inner product and a vector scaling. 
One should always be alert for opportunities to use the 
associative law to save computational effort. 

5 Eigenvalue Problems 

The eigenvalue problem Ax = Ax for a square matrix 
A e c nxn , which seeks an eigenvalue A e C and an 
eigenvector x 0, arises in many forms. Depending on 
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Figure 1 Gershgorin disks for the matrix in (3); 
the eigenvalues are marked as solid dots. 


the application we may want all the eigenvalues or just 
a subset, such as the ten that have the largest real part, 
and eigenvectors may or may not be required as well. 
Whether the problem is Hermitian or non-Hermitian 
changes its character greatly. In particular, while a Her- 
mitian matrix has real eigenvalues and a linearly inde- 
pendent set of n eigenvectors that can be taken to 
be orthonormal, the eigenvalues of a non-Hermitian 
matrix can be anywhere in the complex plane and there 
may not be a set of eigenvectors that spans C n . 

5.1 Bounds and Localization 

One of the first questions to ask is whether we can find 
a finite region containing the eigenvalues. The answer 
is yes because Ax = Ax implies (A[ ||x]| = ||Ax|| ^ 
||A]| || x || , andhence |A| ^ || A ||. So all the eigenvalues he 
in a disk of radius ||A|| about the origin. More refined 
bounds are provided by Gershgorin’s theorem. 

Theorem 1 (Gershgorin’s theorem, 1931). The eigen- 
values of A e c nxn lie in the union of the n disks in 
the complex plane 

Di = jz e C: \z - au\ < ^ layll, i = 1 : n. 
jfi 

An extension of the theorem says that if k disks form 
a connected region that is isolated from the other disks 
then there are precisely k eigenvalues in this region. 
The Gershgorin disks for the matrix 
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1/3 

1/3 

1/3 

3/2 

-2 
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1/4 
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are shown in figure 1. We can conclude that there is one 
eigenvalue in the disk centered at 3, one in the disk 
centered at 6, and two in the union of the other two 
disks. 


268 


IV. Areas of Applied Mathematics 


Gershgorin's theorem is most useful for matrices 
that are close to diagonal, such as those eventually pro- 
duced by the Jacobi iterative method for eigenvalues of 
Hermitian matrices. Improved estimates can be sought 
by applying Gershgorin’s theorem to a matrix D l AD 
similar to A, with the diagonal matrix D chosen in an 
attempt to isolate and shrink the disks. Many variants 
of Gershgorin's theorem exist with disks replaced by 
other shapes. 

The spectral radius p(A) (the largest absolute value 
of any eigenvalue of A) satisfies p(A) ^ ||A||, as shown 
above, but this inequality can be arbitrarily weak, as 
the matrix [ J ® ] shows for \Q\ » 1. It is natural to ask 
whether there are any sharper relations between the 
spectral radius and norms. One answer is the equality 

p(A) = lim ||A k || 1/k . (4) 

k-» oo 

Another is the result that given any e > 0 there is a 
norm such that ||A|| < p(A) + e\ however, the norm 
depends on A. This result can be used to give a proof of 
the fact, discussed in the article on the Jordan canon- 
ical form [11.22], that the powers of A converge to zero 
if p(A) < 1. 

The field of values, also known as the numerical 
range, is a tool that can be used for localization and 
many other purposes. It is defined for A e C nxn by 

F(A) = | : 0 ^ z e C n J . 

The set F (A) is compact and convex (a nontrivial prop- 
erty proved by Toeplitz and Hausdorff), and it con- 
tains all the eigenvalues of A. For normal matrices it 
is the convex hull [11.8] of the eigenvalues. The nor- 
mal matrices A are those for which AA* = A*A, and 
they include the Hermitian, the skew-Hermitian, and 
the unitary matrices. For a Hermitian matrix, F{A) is 
a segment of the real axis, while for a skew-Hermitian 
matrix it is a segment of the imaginary axis. Figure 2 
illustrates two fields of values, the second of which is 
the convex hull of the eigenvalues because a circulant 
matrix [1.2 §18] is normal. 




Figure 2 Fields of values for (a) a pentadiagonal Toeplitz 
matrix and (b) a circulant matrix, both of dimension 32. The 
eigenvalues are denoted by crosses. 


A+AA such that AA = y* AAx / (y*x) +0(\\AA\\ 2 ) and 
so 

A A | ^ ll3 f ll2 J l ^ | 112 HAAII + 0(||AA|| 2 ). 

\y*x | 

The term \\y || 2 lMl 2 /l>'*x| can be shown to be an 
(absolute) condition number for A. It is at least 1 and 
tends to infinity as y and x approach orthogonality 
(which can never exactly be achieved for simple A), so 
A can be very ill-conditioned. However, if A is Hermi- 
tian then we can take y = x and the bound simplifies 
to | AA | < || AA|] + 0(]|AA]| 2 ), so all the eigenvalues of 
a Hermitian matrix are perfectly conditioned. 

Much research has been done to obtain eigenvalue 
perturbation bounds under both weaker and stronger 
assumptions about the problem. Suppose we drop the 
requirement that A is simple. Consider the matrix and 
perturbation 



0 

1 

0~ 


‘o 

0 

0~ 

A = 

0 

0 

1 

, AA = 

0 

0 

0 


0 

0 

0 


E 

0 

0 


The eigenvalues of A are all zero, and those of A + A A 
are the third roots of e. The change in the eigenvalue 
is proportional not to e but to a fractional power of e. 
In general, the sensitivity of an eigenvalue depends on 
the JORDAN structure [11.22] for that eigenvalue. 


5.2 Eigenvalue Sensitivity 

If A is perturbed, how much do its eigenvalues change? 
This question is easy to answer for a simple eigenvalue 
A— one that has algebraic multiplicity [11.22] 1. We 
need the notion of a left eigenvector of A corresponding 
to A , which is a nonzero vector y such that y * A = Ay * . 
If A is simple with right and left eigenvectors x and 
y, respectively, then there is an eigenvalue A + AA of 


5.3 Companion Matrices and the Characteristic 
Polynomial 

The eigenvalues of a matrix A are the roots of its 
characteristic polynomial [1.2 §20], det(A/ - A). 
Conversely, associated with the polynomial 

p( A) = A n - a n - iA n_1 - ■ ■ ■ - ao 
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is the companion matrix 

dn - 1 d-n-2 a 0 

1 0 0 

C= 0 1 0 , 

: 0 : 

0 1 0 

and the eigenvalues of C are the roots of p, as noted in 
METHODS OF SOLUTION [1.3 §7.2]. 

This relation means that the roots of a polynomial 
can be found by computing the eigenvalues of an n x n 
matrix, and this approach is used by some computer 
codes, for example the roots function of MATLAB. 
While standard eigenvalue algorithms do not exploit 
the structure of C, this approach has proved competi- 
tive with specialist polynomial root-finding algorithms. 
Another use for the relation is to obtain bounds for 
roots of polynomials from bounds for matrix eigen- 
values, and vice versa. 

Companion matrices have many interesting proper- 
ties. For example, any nonderogatory [11.22] n x n 
matrix is similar to a companion matrix. Companion 
matrices have therefore featured strongly in matrix 
analysis and also in control theory. However, similar- 
ity transformations to companion form are little used 
in practice because of problems with ill-conditioning 
and numerical instability. 

Returning to the characteristic polynomial, p( A) = 
det(AL - A) = A” - o n - iA” _1 - ■ ■ ■ - ao, we know that 
p( A t - ) = 0 for every eigenvalue A; of A. The Cayley- 
Hamilton theorem says that p(A) = A n - a n -iA n ~ l - 
■ ■ ■ - oqI = 0 (which cannot be obtained simply by 
putting “A = A” in the previous expression!). Hence the 
nth power of A and inductively all higher powers are 
expressible as a linear combination of I, A, . . . , A”' 1 . 
Moreover, if A is nonsingular then from A -1 p (A) = 0 it 
follows that A -1 can also be written as a polynomial in 
A of degree at most n — 1. These relations are not use- 
ful for practical computation because the coefficients 
a; can vary tremendously in magnitude, and it is not 
possible to compute them to high relative accuracy. 

5.4 Eigenvalue Inequalities for Hermitian Matrices 

The eigenvalues of Hermitian matrices A 6 C nxn , 
which in this section we order A n < - ■ • < Ai, satisfy 
many beautiful inequalities. Among the most impor- 
tant are those in the Courant-Fischer theorem (1905), 
which states that every eigenvalue is the solution of a 


min-max problem over a suitable subspace S of C n : 

, . x*Ax 

A i = mm max — - — . 

dim(S) =n-i + 1 O^xeS X*X 

Special cases are \ n = min A .^o x*Ax/{x*x) and Ai = 
max X5 io x * Ax / (x * x) . 

Taking x to be a unit vector e* in the previous formula 
for Ai gives Ai ^ an for all i. This inequality is just 
the first in a sequence of inequalities relating sums of 
eigenvalues to sums of diagonal elements, obtained by 
Schur in 1923: 

k k 

X A i ^ X k= l:n, (5) 

i-1 i = 1 

where {an } is the set of diagonal elements of A 
arranged in decreasing order: an > ■ ■ ■ > a nn . There 
is equality for k = n, since both sides equal trace(A). 
These inequalities say that the vector [Ai,...,A n ] of 
eigenvalues majorizes the vector [dn , . . . , a nn ] of diag- 
onal elements. 

In general there is no useful formula for the eigen- 
values of a sum A + B of Hermitian matrices. How- 
ever, the Courant-Fischer theorem yields the upper and 
lower bounds 

A fc (A) + A„(B) ^ AkiA + B) ^ A fc (A) + Ai(B), 

from which it follows that |Ajt(A + B) - Afc (A) | ^ 
max(|A n (B)|, |Ai(B)|) = ||B|| 2 - The latter inequality 
again shows that the eigenvalues of a Hermitian matrix 
are well-conditioned under perturbation. 

The Cauchy interlace theorem has a different flavor. It 
relates the eigenvalues of successive leading principal 
submatrices A k = A(1 : k, 1 : k) by 

Afc+i(Afc + i) A fc (Afc ) Afc(Afc + i) 

^ ^ A2(Afc + i) ^ A] (A*.) ^ Ai(Afc + i) 

for k = 1 :n - 1, showing that the eigenvalues of A ^ 
interlace those of Afc+i- 

In 1962 Alfred Horn made a conjecture that a cer- 
tain set of linear inequalities involving real numbers 
«j, Pi, and yi, i = 1 :n, is necessary and sufficient for 
the existence of n x n Hermitian matrices A, B, and C 
with eigenvalues the oq, and y,, respectively, such 
that C = A + B. The conjecture was open for many 
years but was finally proved to be true in papers pub- 
lished by Klyachko in 1998 and by Knutson and Tao in 
1999, which exploited deep connections with algebraic 
geometry, representations of Lie groups, and quantum 
cohomology. 



270 


IV. Areas of Applied Mathematics 


5.5 Solving the Non-Hermitian Eigenproblem 

The simplest method for computing eigenvalues, the 
power method, computes just one: the largest in mod- 
ulus. It comprises repeated multiplication of a starting 
vector x by A. Since the resulting sequence is liable to 
overflow or underflow in floating-point arithmetic, one 
normalizes the vector after each iteration. Therefore 
one step of the power method has the form x — Ax, 
x — v _1 x, where v = Xj with \Xj\ = max; \xt\. If A 
has a unique eigenvalue A of largest modulus and the 
starting vector has a component in the direction of the 
corresponding eigenvector, then v converges to A and x 
converges to the corresponding eigenvector. The power 
method is most often applied to (A - pi) -1 , where p is 
an approximation to an eigenvalue of interest. In this 
form it is known as inverse iteration and convergence is 
to the eigenvalue closest to p. We now turn to methods 
that compute all the eigenvalues. 

Since similarities X -1 AX preserve the eigenvalues 
and change the eigenvectors in a controlled way, car- 
rying out a sequence of similarity transformations to 
reduce A to a simpler form is a natural way to tackle the 
eigenproblem. Some early methods used nonunitary X, 
but such transformations are now avoided because of 
numerical instability when X is ill-conditioned. Since 
the 1960s the focus has been on using unitary similar- 
ities to compute the Schur decomposition A = QTQ* , 
where Q is unitary and T is upper triangular. The diag- 
onal entries of T are the eigenvalues of A, and they can 
be made to appear in any order by appropriate choice 
of Q. The first k columns of Q span an invariant 
subspace [1.2 §20] corresponding to the eigenvalues 
til , , tfcfc- Eigenvectors can be obtained by solving 
triangular systems involving T. 

For some matrices the Schur factor T is diagonal; 
these are precisely the normal matrices defined in sec- 
tion 5.1. The real Schur decomposition contains only 
real matrices when A is real: A = QRQ T , where Q is 
orthogonal and R is real upper quasitriangular, which 
means that R is upper triangular except for 2x2 blocks 
on the diagonal corresponding to complex conjugate 
eigenvalues. 

The standard algorithm for solving the non-Her- 
mitian eigenproblem is the QR algorithm, which was 
proposed independently by John Francis and Vera 
Kublanovskaya in 1961. The matrix A 6 C nx " is 
first unitarily reduced to upper Hessenberg form H = 
U*AU ( hij = 0 for i > j + 1), with U a product of 
Householder matrices. The QR iteration constructs a 


sequence of upper Hessenberg matrices beginning with 
Hi = H defined by H * - pQ =: QkRk (QR factorization, 
computed using Givens rotations), Hk+i '■= RkQk + Pkf 
where the pk are shifts chosen to accelerate the con- 
vergence of Hk to upper triangular form. It is easy to 
check that Hk+i = QkHkQk, so the QR iteration carries 
out a sequence of unitary similarity transformations. 

Why the QR iteration works is not obvious but can 
be elegantly explained by analyzing the subspaces 
spanned by the columns of Qk . To produce a practi- 
cal and efficient algorithm various refinements of the 
iteration are needed, which include 

• deflation, whereby when an element on the first 
subdiagonal of Hk becomes small, that element 
is set to zero and the problem is split into two 
smaller problems that are solved independently, 

• a double shift technique for real A that allows 
two QR steps with complex conjugate shifts to be 
carried out entirely in real arithmetic and gives 
convergence to the real Schur form, and 

• a multishift technique for including m different 
shifts in a single QR iteration. 

A proof of convergence is lacking for all current shift 
strategies. Implementations introduce a random shift 
when convergence appears to be stagnating. The QR 
algorithm works very well in practice and continues 
to be the method of choice for the non-Hermitian 
eigenproblem. 

5.6 Solving the Hermitian Eigenproblem 

The eigenvalue problem for Hermitian matrices is eas- 
ier to solve than that for non-Hermitian matrices, and 
the range of available numerical methods is much 
wider. 

To solve the complete Hermitian eigenproblem we 
need to compute the spectral decomposition A = 
QDQ*, where D = diag(A,) contains the eigenvalues 
and the columns of the unitary matrix Q are the cor- 
responding eigenvectors. Many methods begin by uni- 
tary reduction to tridiagonal form T = U*AU, where 
tfj = 0 for |t - j\ > 1 and the unitary matrix U is 
constructed as a product of Householder matrices. The 
eigenvalue problem for T is much simpler, though still 
nontrivial. The most widely used method is the QR algo- 
rithm, which has the same form as in the non-Hermitian 
case but with the upper Hessenberg Hk replaced by 
the Hermitian tridiagonal Tk and the shifts chosen to 
accelerate the convergence of Tk to diagonal form. The 
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Hermitian QR algorithm with appropriate shifts has 
been proved to converge at a cubic rate. 

Another method for solving the Hermitian tridiag- 
onal eigenproblem is the divide and conquer method. 
This method decouples T in the form 


T = 


Tn 

0 


0 

T22 


+ avv*, 


where only the trailing diagonal element of Tn and the 
leading diagonal element of T 22 differ from the corre- 
sponding elements of T and hence the vector v has 
only two nonzero elements. The eigensystems of Tn 
and T 22 are found by applying the method recursively, 
yielding Tn = Q 1 A 1 Q* and T 22 = Q. 2 A 2 Q 2 ■ Then 


'QtAiQf 

0 


0 

Q2A2Q2 


-t-auu* 


= diag(Qi , Q 2 ) ( diag( Ai , A 2 ) + av v * ) diag(Qi , Q 2 ) * , 


where v = diagiQi, Q 2 )* v . The eigensystem of a rank- 
1 perturbed diagonal matrix D + pzz* can be found by 
solving the secular equation obtained by equating the 
characteristic polynomial to zero: 

n lz-1 2 

ir°- 

j = 1 33 

Putting the pieces together yields the overall eigende- 
composition. 

Other methods are suitable for computing just a por- 
tion of the spectrum. Suppose we want to compute the 
kth smallest eigenvalue of T and that we can some- 
how compute the integer N(x) equal to the number of 
eigenvalues of T that are less than or equal to x. Then 
we can apply the bisection method [1.4 §2] to N(x) 
to find the point where N(x) jumps from k — 1 to k. 
We can compute N(x) by making use of the following 
result about the inertia of a Hermitian matrix, defined 
by inertial A) = (v, £, rr), where v is the number of neg- 
ative eigenvalues, T is the number of zero eigenvalues, 
and tt is the number of positive eigenvalues. 


Theorem 2 (Sylvester’s inertia theorem). If A is 

Hermitian and M is nonsingular, then inertia(A) = 
inertia (Af* AM). 


Sylvester’s inertia theorem says that the number 
of negative, zero, and positive eigenvalues does not 
change under congruence transformations. By using GE 
we can factorize 2 T - xl = LDL*, where D is diago- 
nal and L is unit lower bidiagonal (a bidiagonal matrix 


2. The factorization may not exist, but if it does not we can simply 
perturb T slightly and try again without any loss of numerical stability. 


is one that is both triangular and tridiagonal). Then 
inertialT - xl ) = inertia(D), so the number of nega- 
tive or zero diagonal elements of D equals the number 
of eigenvalues of T - xl less than or equal to 0, which 
is the number of eigenvalues of T less than or equal 
to x, that is, N(x). The LDL* factors of a tridiagonal 
matrix can be computed in O(n) flops, so this bisec- 
tion process is efficient. An alternative approach can be 
built by using properties of Sturm sequences, which are 
sequences comprising the characteristic polynomials 
of leading principal submatrices of T - A I. 

5.7 Computing the SVD 

For a rectangular matrix A e C mxn the eigenvalues of 
the Hermitian matrix [ q ] of dimension m + n are 
plus and minus the nonzero singular values of A along 
with m + n - 2 min (m, n) zeros. Hence the SVD can 
be computed via the eigendecomposition of this larger 
matrix. However, this would be inefficient, and instead 
one uses algorithms that work directly on A and are 
analogues of the algorithms for Hermitian matrices. 
The standard approach is to reduce A to bidiagonal 
form B by Householder transformations applied on the 
left and the right and then to apply an adaptation of the 
QR algorithm that works on the bidiagonal factor (and 
implicitly applies the QR algorithm to the tridiagonal 
matrix B*B). 

5.8 Generalized Eigenproblems 

The generalized eigenvalue problem (GEP) Ax = A Bx, 
with A,B G C nxn , can be converted into a standard 
eigenvalue problem if B (say) is nonsingular: B~ l Ax = 
Ax. However, such a transformation is inadvisable 
numerically unless B is very well-conditioned. If A and 
B have a common null vector z, the problem takes on a 
different character because then (A - A B)z = 0 for any 
A; such a problem is called singular. We will assume 
that the problem is regular, so that det(A-AB) # 0. The 
linear polynomial A - AT is sometimes called a pencil. 

It is convenient to write A = tx //?, where a and ft are 
not both zero, and rephrase the problem in the more 
symmetric form ftAx = aBx. If x is a nonzero vector 
such that Bx = 0, then, since the problem is assumed 
to be regular, Ax f 0 and so ft = 0. This means that 
A = 00 is an eigenvalue. Infinite eigenvalues may seem 
a strange concept, but in fact they are no different in 
most respects to finite eigenvalues. 

An important special case is the definite generalized 
eigenvalue problem, in which A and B are Hermitian 
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and B (say) is positive-definite. If B = R*R is a Cholesky 
factorization, then Ax = A Bx can be rewritten as 
R~*AR~ 1 ■ Rx = A Rx, which is a standard eigenprob- 
lem for the Hermitian matrix C = R-*AR~ l . This argu- 
ment shows that the eigenvalues of a definite prob- 
lem are all real. Definite generalized eigenvalue prob- 
lems arise in many physical situations where an energy- 
minimization principle is at work, such as in problems 
in engineering and physics. 

A generalization of the QR algorithm called the QZ 
algorithm computes a generalization to two matrices 
of the Schur decomposition: Q*AZ = T, Q*BZ = S, 
where Q and Z are unitary and T and S are upper tri- 
angular. The generalized Schur decomposition yields 
the eigenvalues as the ratios tu/su and enables eigen- 
vectors to be computed by substitution. 

The quadratic eigenvalue problem (QEP) Q(A)x = 
(A 2 A 2 + AAi + A 0 )x = 0, where A t e C nxn , i = 0:2, 
arises most commonly in the dynamic analysis of struc- 
tures when the finite-element method is used to dis- 
cretize the original PDE into a system of second-order 
ODEs A 2 q(t) + Aiq(t) + Aoq(t) = fit). Here, the A,- 
are usually Hermitian (though A\ is skew-Hermitian in 
gyroscopic systems) and positive (semi)definite. Anal- 
ogously to the GEP, the QEP is said to be regular if 
det(Q(A)) # 0. The quadratic problem differs funda- 
mentally from the linear GEP because a regular problem 
has 2 n eigenvalues, which are the roots of det(Q(A)) = 
0, but at most n linearly independent eigenvectors, 
and a vector may be an eigenvector for two different 
eigenvalues. For example, the QEP with 


Q( A) = A 2 / + A 


-6 

-9 


0 12 
-2 14 


has eigenvalues 1, 2, 3, and 4, with eigenvectors [J], 
[ ° ], [ } ], and [ } ], respectively. Moreover, there is no 
Schur form for three or more matrices; that is, we can- 
not in general find unitary matrices U and V such that 
U*AiV is triangular for i = 0:2. 

AssociatedwiththeQEPisthematrixQ(X) = A 2 X 2 + 
A\X + Aq, with X e C nxn . From the relation 


Q( A) - Q(X) = A 2 (A 2 J - X 2 ) + Ai(A/ - X) 
= (AA 2 +A 2 X + Ai)(A/-X), 


it is clear that if we can find a matrix X such that 
Q(A) = 0, known as a solvent, then we have reduced 
the QEP to finding the eigenvalues of X and solving one 
n x n GEP. For the 2 x 2 Q above there are five sol- 
vents, one of which is [ 2 2 ] . The existence and enumer- 
ation of solvents is nontrivial and leads into the theory 


of matrix polynomials. In general, matrix polynomials 
are matrices of the form Xi= o A' A; whose elements are 
polynomials in a complex variable; an older term for 
such matrices is \-matrices. 

The standard approach for numerical solution of the 
QEP mimics the conversion of the scalar polynomial 
root problem into a matrix eigenproblem described in 
section 5.3. From the relation 


L(\)z = 


/ [A, A 0 

\[l 0 

Q(A)x 

0 



0 

-I 


Ax 

x 


we see that the eigenvalues of the quadratic Q are the 
eigenvalues of the 2nx2n linear polynomial 1(A). This 
is an example of an exact linearization process — thanks 
to the hidden A in the eigenvector! The eigenvalues of L 
can be found using the QZ algorithm. The eigenvectors 
of L have the form z = [ ] , where x is an eigenvector 
of Q, and so x can be obtained from either the first n 
(if A 0) or the last n components of z. 


6 Sparse Linear Systems 

For linear systems coming from discretization of dif- 
ferential equations it is common that A is banded, 
that is, the nonzero elements lie in a band about the 
main diagonal. An extreme case is a tridiagonal matrix, 
of which the classic example is the second-difference 
matrix, illustrated for n = 4 by 
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This matrix corresponds to a centered finite-differ- 
ence approximation [11.11] to a second derivative: 
/"(x) ~ (fix + h) - 2 f(x) + fix - h))/h 2 . Note 
that A -1 is a full matrix. For banded matrices, GE pro- 
duces banded LU factors and its computational cost is 
proportional to n times the square of the bandwidth. 

A matrix is sparse if advantage can be taken of the 
zero entries because of either their number or their dis- 
tribution. A banded matrix is a special case of a sparse 
matrix. Sparse matrices are stored on a computer not as 
a square array but in a special format that records only 
the nonzeros and their location in the matrix. This can 
be done with three vectors: one to store the nonzero 
entries and the other two to define the row and column 
indices of the elements in the first vector. 
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Sparse matrices help to explain the tenet: never solve 
a linear system Ax = b by computing x = A^ 1 x b. The 
reasons for eschewing A are threefold. 

• Computing A ~ 1 requires three times as many flops 
as solving Ax = b by GE with partial pivoting. 

• GE with partial pivoting is backward stable for 
solving Ax = b (see section 8) but solution via 
A~ [ is not. 

• If A is sparse, ,4 ' is generally dense and so 
requires much more storage than GE with partial 
pivoting. 

When GE is applied to a sparse matrix, fill-in occurs 
when the row operations cause a zero entry to become 
nonzero during the elimination. To minimize the stor- 
age and the computational cost, fill-in must be avoided 
as much as possible. This can be done by employing 
row and column interchanges to choose a suitable pivot 
from the active submatrix. The first such strategy was 
introduced by Markowitz in 1957. At the fcth stage, with 
cf> denoting the number of nonzeros in rows k to n of 
column j and r- the number of nonzeros in columns 
k to n of row i, the Markowitz strategy finds the pair 
( r,s ) that minimizes 1 ) (Cy k ^ — 1) over all nonzero 

(k) 1 ^ ( k ) 

potential pivots a\, and then takes a rs as the pivot. 
The quantity being minimized is a bound on the fill-in. 
In practice, the potential pivots must be restricted to 
those not too much smaller in magnitude than the par- 
tial pivot in order to preserve numerical stability. The 
result of GE with Markowitz pivoting is a factorization 
PAQ = LU, where P and Q are permutation matrices. 

The analogue of the Markowitz strategy for Her- 
mitian positive-definite matrices chooses a diagonal 
entry a ® as the pivot, where r® is minimal. This is 
the minimum-degree algorithm, which has been very 
successful in practice. Figure 3 shows in the first 
row a sparse and banded symmetric positive-dehnite 
matrix A of dimension 225 followed to the right by its 
Cholesky factor. The Cholesky factor has many more 
nonzeros than A. The second row shows the matrix 
PAP T produced by an approximate minimum-degree 
ordering (produced by the MATTAB symamd function) 
and its Cholesky factor. We can see that the permu- 
tations have destroyed the band structure but have 
greatly reduced the fill-in, producing a much sparser 
Cholesky factor. 

As an alternative to GE for solving sparse linear sys- 
tems one can apply iterative methods, described in sec- 
tion 9; for sufficiently large problems these are the only 
feasible methods. 
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Figure 3 Sparsity plots of a symmetric positive-definite 
matrix (left) and its Cholesky factor (right) for original 
matrix (first row) and reordered matrix (second row), nz is 
the number of nonzeros. 

7 Overdetermined and 
Underdetermined Systems 

Linear systems Ax = b with a rectangular matrix 
A e C mxn are very common. They break into two cat- 
egories: overdetermined systems, with more equations 
than unknowns (m > n), and underdetermined systems, 
with fewer equations than unknowns (m < n). Since in 
general there is no solution when m > n and there are 
many solutions when m < n, extra conditions must 
be imposed for the problems to be well defined. These 
usually involve norms, and different choices of norms 
are possible. We will restrict our discussion mainly to 
the 2-norm, which is the most important case, but other 
choices are also of practical interest. 

7.1 The Linear Least-Squares Problem 

When m> n the residual r = b - Ax cannot in general 
be made zero, so we try to minimize its norm. The most 
common choice of norm is the 2-norm, which gives the 
linear least-squares problem 

min || b - Ax || 2 - (6) 

XGC n 

This choice can be motivated by statistical consider- 
ations (the Gauss-Markov theorem) or by the fact that 
the square of the 2-norm is differentiable, which makes 
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the problem explicitly solvable. Indeed, by setting the 
gradient of || b - Axil 2 to zero we obtain the normal 
equations A* Ax = A* b, which any solution of the least- 
squares problem must satisfy. If A has full rank then 
A*A is positive-definite and so there is a unique solu- 
tion, which can be computed by solving the normal 
equations using Cholesky factorization. For reasons of 
numerical stability it is preferable to use a QR factor- 
ization: if A = Q[ ** ] then the normal equations reduce 
to the triangular system R\x = c, where c is the first n 
components of Q*b. 

When A is rank deficient there are many least-squares 
solutions, which vary widely in norm. A natural choice 
is one of minimal 2-norm, and in fact there is a unique 
minimal 2 -norm solution, xls, given by 
r 

xls = 

i= 1 

where 

A = USV*, U = [Ui,...,U m ], V = [Vi,...,V n l (7) 

is an SVD and r = rank(A). The use of this formula in 
practice is not straightforward because a matrix stored 
in floating-point arithmetic will rarely have any zero 
singular values. Therefore r must be chosen by desig- 
nating which singular values can be regarded as negligi- 
ble, and this choice should take account of the accuracy 
with which the elements of A are known. 

Another choice of least-squares solution in the rank- 
deficient case is a basic solution : one with at most r 
nonzeros. Such a solution can be computed via the QR 
factorization with column pivoting. 

7.2 Underdetermined Systems 

When m < n and A has full rank, there are infinitely 
many solutions to Ax = b and again it is natural to seek 
one of minimal 2-norm. There is a unique such solution 
xls = A* (AA*)~ l b, and it is best computed via a QR 
factorization, this time of A*. A basic solution, with 
m nonzeros, can alternatively be computed. As a sim- 
ple example, consider the problem “find two numbers 
whose sum is 5,” that is, solve [1 l][xl ] = 5. A basic 
solution is [5 0] T , while the minimal 2-norm solution 
is [5/2 5/2] T . Minimal 1-norm solutions to underdeter- 
mined systems are important in compressed sensing 
[VII. 10], 

7.3 Pseudoinverse 

The analysis in the previous two subsections can be 
unified in a very elegant way by making use of the 


Moore-Penrose pseudoinverse A + of A e C mxn , which 
is defined as the unique X 6 C nxm satisfying the 
Moore-Penrose conditions 

AXA = A, XAX = X, 

(AX)* = AX, (XA)* = XA. 

(It is certainly not obvious that these equations have 
a unique solution.) In the case where A is square and 
nonsingular, it is easily seen that A + is just A -1 . More- 
over, if rank(A) = n then A + = (A*A)~ 1 A*, while if 
rank(A) = m then A + = A*(AA*)" 1 . In terms of the 
SVD (7), 

A + = V / diag(trf 1 ,...,(r“ 1 ,0, ...,0)1/*, 

where r = rank(A). The formula xls = A + b holds for 
all m and n, so the pseudoinverse yields the minimal 
2-norm solution to both the least-squares (overdeter- 
mined) problem Ax = b and an underdetermined sys- 
tem Ax = b. The pseudoinverse has many interesting 
properties, including (A + ) + = A, but it is not always 
true that (AB) + = B + A + . 

Although the pseudoinverse is a very useful theoret- 
ical tool, it is rarely necessary to compute it explicitly 
(just as for its special case the matrix inverse). 

The pseudoinverse is just one of many ways of gen- 
eralizing the notion of inverse to rectangular matri- 
ces, but it is the right one for minimum 2-norm solu- 
tions to linear systems. Other generalized inverses can 
be obtained by requiring only a subset of the four 
Moore-Penrose conditions to hold. 

8 Numerical Considerations 

Prior to the introduction of the first digital comput- 
ers in the 1940s, numerical computations were carried 
out by humans, sometimes with the aid of mechanical 
calculators. The human involvement in a sequence of 
calculations meant that potentially dangerous events 
such as dividing by a tiny number or subtracting two 
numbers that agree to almost all their significant digits 
could be observed, their effect monitored, and possible 
corrective action taken — such as temporarily increas- 
ing the precision of the calculations. On the very early 
computers intermediate results were observed on a 
cathode ray tube monitor, but this became impossi- 
ble as problem sizes increased (along with available 
computing power). Fears were raised in the 1940s that 
algorithms such as GE would suffer exponential growth 
of errors as the problem dimension increased, due to 
the rapidly increasing number of arithmetic operations, 
each having its associated rounding error [11.13]. 
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These fears were particularly concerning given that the 
error growth might be unseen and unsuspected. 

The subject of rounding error analysis grew out of 
the need to understand the effect of rounding errors on 
algorithms. The person who did the most to develop the 
subject was James Wilkinson, whose influential papers 
and 1961 and 1965 books showed how backward 
error analysis [1.2 §23] can be used to obtain deep 
insights into numerical stability. We will discuss just 
two particular examples. 

Wilkinson showed that when a nonsingular linear 
system Ax = b is solved by GE in floating-point 
arithmetic the computed solution x satisfies 

(A + A A)x = b, || AA|| oo ^ p(ri)p n u\\A\\ x . 
Here, p(n) is a cubic polynomial, the growth factor 

ma x ujik \a^\ 

Pn — I I ^ 1 

maxjj lay- 

measures the growth of elements during the elimina- 
tion, and u is the unit roundoff [11.13]. This is a back- 
ward stability result : it says that the computed solu- 
tion x is the exact solution of a perturbed system. Ide- 
ally, we would like 1 1 A A || «> < u || A || ^ , which reflects the 
uncertainty caused by converting the elements of A to 
floating-point numbers. The polynomial term pin) is 
pessimistic and might be more realistically replaced by 
its square root. The danger term is the growth factor p n , 
and the conclusion from Wilkinson’s analysis is that a 
pivoting strategy should aim to keep p n small. If no 
pivoting is done, p n can be arbitrarily large (e.g., for 
A = [ f } ] with 0 < s « 1, p n =s 1 /e). For partial pivot- 
ing, however, it can be shown that p n ^ 2 M ~ 1 and that 
this bound is attainable. In practice, p n is almost always 
of modest size for partial pivoting (p n ^ 50, say); why 
this should be so remains one of the great mysteries of 
numerical analysis! 

One of the benefits of Wilkinson’s backward error 
analysis is that it enables us to identify classes of matri- 
ces for which pivoting is not necessary, that is, for 
which the LU factorization A = LU exists and p n is 
nicely bounded. One such class is the matrices that 
are diagonally dominant by either rows or columns, for 
which p n ^ 2. 

The potential instability of GE can be attributed to the 
fact that A is premultiplied by a sequence of nonunitary 
transformations, any of which can be ill-conditioned. 
Many algorithms, including Householder QR factoriza- 
tion and the QR algorithm for eigenvalues, use exclu- 
sively unitary transformations. Such algorithms are 


275 

usually (but not always) backward stable, essentially 
because unitary transformations do not magnify errors: 
||[MV|| = || A|| for any unitary U and V for the 2-norm 
and the Frobenius norm. As an example, the QR algo- 
rithm applied to A 6 C ftx " produces a computed upper 
triangular matrix f such that 

Q*(A + AA)Q=t, ||AA||f s? p(n)ti||A|| F , 

where Q is some exactly unitary matrix and pin) is a 
cubic polynomial. The computed Schur factor Q is not 
necessarily close to Q — which in turn is not necessarily 
close to the exact Q!— but it is close to being orthogo- 
nal: ||Q*Q-/||f =% pin)u. This distinction between the 
different Q matrices is an indication of the subtleties 
of backward error analysis. For some problems it is not 
clear exactly what form of backward error result it is 
possible to prove while obtaining useful bounds. How- 
ever, the purpose of a backward error analysis is always 
the same: either to show that an algorithm behaves in a 
numerically stable way or to shed light on how it might 
fail to do so and to indicate what quantities should be 
monitored in order to identify potential instability. 

9 Iterative Methods 

In numerical linear algebra, methods can broadly be 
divided into two classes: direct and iterative. Direct 
methods, such as GE, solve a problem in a fixed num- 
ber of arithmetic operations or a variable number that 
in practice is fairly constant, as for the QR algorithm for 
eigenvalues. Iterative methods are infinite processes 
that must be truncated at some point when the approx- 
imation they provide is “good enough.” Usually, iter- 
ative methods do not transform the matrix in ques- 
tion and access it only through matrix-vector products; 
this makes them particularly attractive for large, sparse 
matrices, where applying a direct method may not be 
practical. 

We have already seen in section 5.5 a simple iterative 
method for the eigenvalue problem: the power method. 
The stationary r iterative methods are an important class 
of iterative methods for solving a nonsingular linear 
system Ax = b. These methods are best described in 
terms of a splitting 

A = M - N, 

with M nonsingular. The system Ax = b can be rewrit- 
ten Mx = Nx + b, which suggests constructing a 
sequence {x (fc) } from a given starting vector x l0) via 

Mx (fc+1) = Nx {k) + b. 


( 8 ) 



276 


IV. Areas of Applied Mathematics 


Different choices of M and N yield different methods. 
The aim is to choose M in such a way that it is inexpen- 
sive to solve (8) while M is a good enough approxima- 
tion to A that convergence is fast. It is easy to analyze 
convergence. Denote by e {k) = x (fc) - x the error in the 
fcth iterate. Subtracting Mx = Nx + b from (8) gives 
M(x (fc+1) - x) = JV(x <fc) - x), so 

e (fc+1) = M~ l Ne [k) = ■ ■ ■ = (M~ 1 N) k+1 e (0) . (9) 

If p(M~ l N) < 1 then (M~ l N) k — 0 as k — ■ oo (see JOR- 
DAN canonical form [11.22]) and so x (fc) converges to 
x at a linear rate. In practice, for convergence in a rea- 
sonable number of iterations we need p(M _1 NJ to be 
sufficiently less than 1 and the powers of M~ r N should 
not grow too large initially before eventually decaying; 
in other words, M~ l N must not be too nonnormal. 

Three standard choices of splitting are as follows, 
where D = diag(A) and L and U denote the strictly 
lower and strictly upper triangular parts of A, respec- 
tively. 

• M = D,N=-(L + U): Jacobi iteration (illustrated 
in METHODS OF SOLUTION [1.3 §6]). 

• M = D + L, N = -U: Gauss-Seidel iteration. 

. M = (1 /(jo)D + L, N = ((1 - u>)/(jo)D - U, where 
to G (0,2) is a relaxation parameter: successive 
overrelaxation (SOR) iteration. 

Sufficient conditions for convergence are that A is 
strictly diagonally dominant by rows for the Jacobi 
iteration and that A is symmetric positive-definite for 
the Gauss-Seidel iteration. How to choose to so that 
p(M _1 7V], 0 ) is minimized for the SOR iteration was 
elucidated in the landmark 1950 Ph.D. thesis of David 
Young. 

The Google PageRank algorithm [VI.9], which un- 
derlies Google’s ordering of search results, can be inter- 
preted as an application of the Jacobi iteration to a 
certain linear system involving the adjacency matrix 
[11.16] of the graph corresponding to the whole World 
Wide Web. However, the most common use of station- 
ary iterative methods is as preconditioners within other 
iterative methods. 

The aim of preconditioning is to convert a given lin- 
ear system Ax = b into one that can be solved more 
cheaply by a particular iterative method. The basic idea 
is to use a nonsingular matrix W to transform the sys- 
tem to (W -1 A)x = W~ k b in such a way that (a) the pre- 
conditioned system can be solved in fewer iterations 
than the original system and (b) matrix-vector multi- 
plications with W _1 A (which require the solution of a 


linear system with coefficient matrix W) are not signif- 
icantly more expensive than matrix-vector multiplica- 
tions with A. In general, this is a difficult or impossible 
task, but in many applications the matrix A has struc- 
ture that can be exploited. For example, many elliptic 
PDE problems lead to a positive-definite matrix A of the 
form 



where M\z = d\ and M 2 Z = d 2 are easy to solve. In 
this case it is natural to take W = diag(Mi,M 2 ) as the 
preconditioner. When A is Hermitian positive-definite 
the preconditioned system is written in a way that pre- 
serves the structure. For example, for the Jacobi pre- 
conditioner, D = diag(A), the preconditioned system 
would be written D~ ll2 AD~ ll2 x = b, where x = D 1/2 x 
andfc = D~ ll2 b. Here, the matrixD~ 1/2 AD~ 1/2 has unit 
diagonal and off-diagonal elements lying between -1 
and 1. 

The most powerful iterative methods for linear sys- 
tems Ax = b are the Krylov methods. In these methods 
each iterate x ^ is chosen from the shifted subspace 
x <0) + Xk(A, r (0) ), where 

Xk (A, r (0) ) = spanfr <0) , Ar <0) , . . . , A k ~ 1 r (0) } 

is a krylov subspace [11.23] of dimension fc, with 
r (fc) _ p _ Different strategies for choosing 

approximations from within the Krylov subspaces yield 
different methods. For example, the conjugate gradi- 
ent method (CG, for Hermitian positive-definite A) and 
the full orthogonalization method (FOM, for general A) 
make the residual r (fc) orthogonal to the Krylov sub- 
space Xk(A, r (0) ), while the minimal residual method 
(MINRES, for Hermitian A) and the generalized min- 
imal residual method (GMRES, for general A) mini- 
mize the 2-norm of the residual over all vectors in the 
Krylov subspace. How to compute the vectors defined 
in these ways is nontrivial. It turns out that CG can 
be implemented with a recurrence requiring just one 
matrix-vector multiplication and three inner products 
per iteration, and MINRES is just a little more expen- 
sive. GMRES, being applicable to non-Hermitian matri- 
ces, is significantly more expensive, and it is also much 
harder to analyze its convergence behavior. For gen- 
eral matrices there are alternatives to GMRES that 
employ short recurrences. We mention just BiCGSTAB, 
which has the distinction that the 1992 paper by Henk 
van der Vorst that introduced it was the most-cited 
paper in mathematics of the 1990s. 
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Theoretically, Krylov methods converge in at most 
n iterations for a system of dimension n. However, in 
practical computation rounding errors intervene and 
the methods behave as truly iterative methods not 
having finite termination. Since n is potentially huge, 
a Krylov method would not be used unless a good 
approximate solution was obtained in many fewer than 
n iterations, and preconditioning plays a crucial role 
here. Available error bounds for a method help to guide 
the choice of preconditioner, but care is needed in inter- 
preting the bounds. To illustrate this, consider the CG 
method for Ax = b, where A is Hermitian positive- 
definite. In the A-norm, \\z\\a = (z*Az) 1/2 , the error 
on the fcth step satisfies 


Ak)\ 


< 2\\x — x 


( 0 ) 


( k 2 (A ) 1/2 - l \ fc 
\K2 (A) 1 / 2 + 1/ 


where i< 2 (A) = ||A|| 2 l|A _1 || 2 . If we can precondition A 
so that its 2-norm condition number is very close to 1, 
then fast convergence is guaranteed. However, another 
result says that if A has k distinct eigenvalues then CG 
converges in at most k iterations. A better approach 
might therefore be to choose the preconditioner so 
that the eigenvalues of the preconditioned matrix are 
clustered into a small number of groups. 

Another important class of iterative methods is 
multigrid methods [IV. 1 3 §3], which work on a hier- 
archy of grids that come from a discretization of 
an underlying PDE (geometric multigrid) or are con- 
structed artificially from a given matrix (algebraic 
multigrid). 

An important practical issue is how to terminate 
an iteration. Popular approaches are to stop when the 
residual r ik) = b - Ax (k) (suitably scaled) is small or 
when an estimate of the error x - x (k) is small. Compli- 
cating factors include the fact that the preconditioner 
can change the norm and a possible desire to match the 
error in the iterations with the discretization error in 
the PDE from which the linear system might have come 
(as there is no point solving the system to greater accu- 
racy than the data warrants). Research in recent years 
has led to good understanding of these issues. 

The ideas of Krylov methods and preconditioners can 
be applied to problems other than linear systems. A 
popular Krylov method for solving the least-squares 
problem (6) is LSQR, which is mathematically equiva- 
lent to applying CG to the normal equations. In large- 
scale eigenvalue problems only a few eigenpairs are 
usually required. A number of methods project the 
original matrix onto a Krylov subspace and then solve 


a smaller eigenvalue problem. These include the Lanc- 
zos method for Hermitian matrices and the Arnoldi 
method for general matrices. Also of much current 
research interest are rational Krylov methods based 
on RATIONAL GENERALIZATIONS OF KRYLOV SUBSPACES 
[11.23]. 


10 Nonnormality and Pseudospectra 


Normal matrices A e C nxn (defined in section 5.1) 
have the property that they are unitarily diagonaliz- 
able: A = QDQ* for some unitary Q and diagonal 
D = diag(A;) containing the eigenvalues on its diago- 
nal. In many respects normal matrices have very pre- 
dictable behavior. For example, ||>T fe || 2 = p(A) k and 
|| e tA ||2 = e a<tA> , where the spectral abscissa a(tA) is 
the largest real part of any eigenvalue of tA. However, 
matrices that arise in practice are often very nonnor- 
mal. The adjective “very” can be quantified in various 
ways, of which one is the Frobenius norm of the strictly 
upper triangular part of the upper triangular matrix T 
in the Schur decomposition A = QTQ*. For example, 
the matrix [ t ' 0 ' f ] is nonnormal for 0^0 and grows 
increasingly nonnormal as \ 0\ increases. 

Consider the moderately nonnormal matrix 


-0.97 25 

0 -0.3 


( 10 ) 


While the powers of A ultimately decay to zero, since 
p(A ) = 0.97 < 1, we see from figure 4 that initially they 
increase in norm. Likewise, since tx(A) = -0.3 < 0 the 
norm ||e tA ||2 tends to zero as t — 00 , but figure 4 shows 
that there is an initial hump in the plot. In station- 
ary iterations the hump caused by a nonnormal iter- 
ation matrix M~ l N can delay convergence, as is clear 
from (9). In finite-precision arithmetic it can even hap- 
pen that, for a sufficiently large hump, rounding errors 
cause the norms of the powers to plateau at the hump 
level and never actually converge to zero. 

How can we predict the shape of the curves in fig- 
ure 4? Let us concentrate on j|A fc || 2 . Initially it grows 
like HAH 2 and ultimately it decays like p(A) k , the decay 
rate following from (4). The height of the hump is 
related to pseudospectra, which have been popularized 
by Nick Trefethen. 

The £ -pseudospectrum of A e C nxn is defined, for a 
given s > 0, to be the set 


A a (A) = {z e C: z is an eigenvalue of A + E 

for some E with j|£ II 2 < £}, (11) 
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Figure 4 2-norms of (a) powers and (b) exponentials of 2 x 2 matrix A in (10). 


and it can also be represented, in terms of the resolvent 
( zl - A ) -1 , as 

A £ (A) = {zeC:\\(zI-A)- 1 \\ 2 > s- 1 }. 


The 0.001-pseudospectrum, for example, tells us the 
uncertainty in the eigenvalues of A if the elements are 
known only to three decimal places. Pseudospectra pro- 
vide much insight into the effects of nonnormality of 
matrices and (with an appropriate extension of the def- 
inition) linear operators. For nonnormal matrices the 
pseudospectra are much bigger than a perturbation of 
the spectrum by e . It can be shown that for any e > 0, 


sup ||A fc || ^ 
k^O 


Pe(A) - 1 
E 


!IA fc || < 


Pa(A) k + 1 
E 


where the pseudospectral radius p £ (A) = max{|A| : A 6 
A e (A)}. For A in (10) and e = 10~ 2 , these inequali- 
ties give an upper bound of 230 for HA 3 1| and a lower 
bound of 23 for sup fc ^ 0 ||xA fc || , and figure 5 plots the 
corresponding f-pseudospectrum. 


1 1 Structured Matrices 


0.4 



-0.4 


-1.2 -1.0 -0.8 -0.6 -0.4 -0.2 

Figure 5 Approximation to 10 _2 -pseudospectrum of A in 
(10) comprising eigenvalues of 5000 randomly perturbed 
matrices A + E in (11). The eigenvalues of A are marked by 
white circles. 


of which were historically important in the analysis of 
iterative methods for linear systems arising from the 
discretization of differential equations. 


In a wide variety of applications the matrices have 
a special structure. The matrix elements might form 
a pattern, as for toeplitz or Hamiltonian matri- 
ces [1.2 §18], the matrix may satisfy a nonlinear equa- 
tion such as A*2A = 2, where 2 = diag(±l), which 
yields the pseudo-unitary matrices A, or the subma- 
trices may satisfy certain rank conditions (as for qua- 
siseparable matrices). We discuss here two of the oldest 
and most-studied classes of structured matrices, both 


11.1 Nonnegative Matrices 

A nonnegative matrix is a real matrix all of whose 
entries are nonnegative. A number of important classes 
of matrices are subsets of the nonnegative matrices. 
These include adjacency matrices, stochastic matri- 
ces [11.25], and Leslie matrices (used in population 
modeling). Nomiegative matrices have a large body 
of theory, which originates with Perron in 1907 and 
Frobenius in 1908. 
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To state the celebrated Perron-Frobenius theorem 
we need the definition that A e R nxri with n ^ 2 is 
reducible if there is a permutation matrix P such that 


P J AP 


A n Au 
0 A22 _ 


the result that for A,B e R nxn , with |A| denoting the 
matrix (|ay|), 

I Vi, j => p(A) ^ p(|A|) ^ p(B). 

11.2 M-Matrices 


where An and A 22 are square, nonempty submatrices, 
and it is irreducible if it is not reducible. A matrix with 
positive entries is trivially irreducible. A useful char- 
acterization is that A is irreducible if and only if the 
directed graph associated with A (which has n vertices, 
with an edge connecting the ith vertex to the jth vertex 
if ay 0) is STRONGLY CONNECTED [11.16]. 

Theorem 3 (Perron-Frobenius). If A e R nxn is non- 
negative and irreducible then 

(1) p(A) > 0, 

(2) p(A) is an eigenvalue of A, 

(3) there is a positive vector x such that Ax = p(A)x, 

(4) p(A) is an eigenvalue of algebraic multiplicity 1. 

To illustrate the theorem consider the following two 
irreducible matrices and their eigenvalues: 

8 1 6 

A = 3 5 7 , A(A) = {15,±2V6}, 

4 9 2 

0 0 6 

B= § 0 0, A(B) = {l,i(-l±V3i)J. 

0 i 0 

The Perron-Frobenius theorem correctly tells us that 
p(A) = 15 is a distinct eigenvalue of A and that it has 
a corresponding positive eigenvector, which is known 
as the Perron vector. The Perron vector of A is the vec- 
tor of all ones, as A forms a magic square and p(A) is 
the magic sum! The Perron vector of B, which is both a 
Leslie matrix and a companion matrix, is [6 3 1 ] T . There 
is one notable difference between A and B: for A, p(A) 
exceeds the other eigenvalues in modulus, but all three 
eigenvalues of B have modulus 1. In fact, Perron's orig- 
inal version of Theorem 3 says that if A has all positive 
elements then p(A) is not only an eigenvalue of A but 
is larger in modulus than every other eigenvalue. Note 
that B 3 = I, which provides another way to see that the 
eigenvalues of B all have modulus 1. 

We saw in section 9 that the spectral radius plays an 
important role in the convergence of stationary itera- 
tive methods, through p(M~ 1 N), where A = M - N is a 
splitting. In comparing different splittings we can use 


A G R nx ” is an M-matrix if it can be written in the form 
A = si - B, where B is nonnegative and 5 > p(B). M- 
matrices arise in many applications, a classic one being 
Leontief's input-output models in economics. 

The special sign pattern of an M-matrix— positive 
diagonal elements and nonpositive off-diagonal ele- 
ments— combines with the spectral radius condition to 
give many interesting characterizations and properties. 
For example, a nonsingular matrix A with nonpositive 
off-diagonal elements is an M-matrix if and only if A -1 
is nonnegative. Another characterization, which makes 
connections with section 1, is that A is an M-matrix 
if and only if A has positive diagonal entries and AD 
is diagonally dominant by rows for some nonsingular 
diagonal matrix D. 

An important source of M-matrices is discretizations 
of differential equations, and the archetypal example is 
the second-difference matrix, described at the start of 
section 6, which is an M-matrix multiplied by -1. For 
this application it is an important result that when A 
is an M-matrix the Jacobi and Gauss-Seidel iterations 
for Ax = b both converge for any starting vector— a 
result that is part of the more general theory of regular 
splittings. 

Another important property of M-matrices is imme- 
diate from the definition: the eigenvalues all lie in the 
open right half-plane. This means that M-matrices are 
special cases of positive stable matrices, which in turn 
are of great interest due to the fact that the stabil- 
ity of various mathematical processes is equivalent to 
positive (or negative) stability of an associated matrix. 

The class of matrices whose inverses are M-matrices 
is also much studied. To indicate why, we state a result 
about matrix roots. It is known that if A is an M-matrix 
then A 1/2 is also an M-matrix. But if A is stochastic (that 
is, it is nonnegative and has unit row sums), A 1/2 may 
not be stochastic. However, if A is both stochastic and 
the inverse of an M-matrix, then A 1/p is stochastic for 
all positive integers p. 

12 Matrix Inequalities 

There is a large body of work on matrix inequal- 
ities, ranging from classical nineteenth-century and 
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early twentieth-century inequalities (some of which are 
described in section 5.4) to more recent contributions, 
which are often motivated by applications, notably in 
statistics, physics, and control theory. In this section we 
describe just a few examples, chosen for their interest 
or practical usefulness. 

An important class of inequalities on Hermitian 
matrices is expressed using the Lowner (partial) order- 
ing in which, for Hermitian X and Y , X ^ Y denotes that 
X - Y is positive-semidefinite while X > Y denotes that 
X - Y is positive-definite. Many inequalities between 
real numbers generalize to Hermitian matrices in this 
ordering. For example, if A, B, C are Hermitian and A 
commutes with B and C, then 

A ^ 0, B sj C => AB ^ AC. 

A function / is matrix monotone if it preserves the 
order, that is, A ^ B implies f(A) ^ /(B), where f(A) 
denotes a function of a matrix [11.14]. Much is known 
about this class of functions, including that t 1/2 and 
log t are matrix monotone but t 2 is not. 

Many matrix inequalities involve norms. One exam- 
ple is 

|||A|-|B|||f^#I|A-B|| f , 

where A, B e C mxn and | - 1 is the matrix absolute value 
defined in section 2. This inequality can be regarded as 
a perturbation result that shows the matrix absolute 
value to be very well-conditioned. 

An example of an inequality that finds use in the 
analysis of convergence of methods in nonlinear opti- 
mization is the Kantorovich inequality, which for Her- 
mitian positive-definite A with eigenvalues 
Ai and x / 0 is 

(x*Ax)(x*A~ 1 x) (Ai + A„) 2 
(x*x) 2 s ' 4AiA„ ' 

This inequality is attained for some x, and the left-hand 
side is always at least 1. 

Many inequalities are available that generalize scalar 
inequalities for means. For example, the arithmetic- 
geometric mean inequality (ab) l/2 ^ \(a + b) for pos- 
itive scalars has an analogue for Hermitian positive- 
definite A and B in the inequality A # B |(A + B), 
where A#B is the geometric mean defined as the unique 
Hermitian positive-definite solution to XA^ 1 X = B.The 
geometric mean also satisfies the extremal property 


A # B = max 


| A \ X = X* , 




which hints at matrix completion problems, in which the 
aim is to choose missing elements of a matrix in order 


to achieve some goal, which could be to satisfy a partic- 
ular matrix property or, as here, to maximize an objec- 
tive function. Another mean for Hermitian positive- 
definite matrices (and applicable more generally) is the 
log-Euclidean mean, exp(^(logA + logB)), where log 
is the principal logarithm [11.14], which is used in 
image registration, for example. 

Finally, we mention an inequality for the matrix expo- 
nential. Although there is no simple relation between 
e A+B and e A e B in general, for Hermitian A and B the 
inequality trace(e A+B ) ^ trace(e A e B ) was proved inde- 
pendently by S. Golden and J. Thompson in 1965. Orig- 
inally of interest in statistical mechanics, the Golden- 
Thompson inequality has more recently found use 
in random-matrix theory [IV.24]. Again for Her- 
mitian A and B, the related inequalities ||e A+B || ^ 
|| e A/ 2 e B e A/ 2 || ^ || e A e B || hold for any unitarily invariant 
norm. 

13 Library Software 

From the early days of digital computing the benefits 
of providing library subroutines for carrying out basic 
operations such as the addition of vectors and the for- 
mation of vector inner products was recognized. Over 
the ensuing years many matrix computation research 
codes were published, including in the linear algebra 
volume of the Handbook for Automatic Computation 
(1971) and in the Collected Algorithms of the ACM. 
Starting in the 1970s the concept of standardized sub- 
programs was developed in the form of the Basic Lin- 
ear Algebra Subprograms (BLAS), which are specifica- 
tions for vector (level 1), matrix-vector (level 2), and 
matrix-matrix (level 3) operations. The BLAS have been 
widely adopted, and highly optimized implementations 
are available for most machines. The freely available 
LAPACK library of Fortran codes represents the current 
state of the art for solving dense linear equations, least- 
squares problems, and eigenvalue and singular value 
problems. Many modern programming packages and 
environments build on LAPACK. 

It is interesting to note that the TOP500 list (www 
.top500.org) ranks the world's fastest computers by 
their speed (measured in flops per second) in solving 
a random linear system Ax = b by GE. This benchmark 
has its origins in the 1970s LINPACK project, a precur- 
sor to LAPACK, in which the performance of contempo- 
rary machines was compared by running the LINPACK 
GE code on a 100 x 100 system. 
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14 Outlook 

Matrix analysis and numerical linear algebra remain 
very active areas of research. Many problems in applied 
mathematics and scientific computing require the solu- 
tion of a matrix problem at some stage, so there is 
always a demand for better understanding of matrix 
problems and faster and more accurate algorithms for 
their solution. As the overarching applications evolve, 
new problem variants are generated, often involving 
new assumptions on the data, different requirements 
on the solution, or new metrics for measuring the suc- 
cess of an algorithm. A further driver of research is 
computer hardware. With the advent of processors with 
many cores, the use of accelerators such as graph- 
ics processing units, and the harnessing of vast num- 
bers of processors for parallel computing, the standard 
algorithms in numerical linear algebra are having to be 
reorganized and possibly even replaced, so we are likely 
to see significant changes in the coming years. 

Further Reading 

Three must-haves for researchers are the influential 
treatment of numerical linear algebra by Golub and 
Van Loan and the two volumes by Horn and Johnson, 
which contain a comprehensive treatment of matrix 
analysis. 
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IV. 11 Continuous Optimization 
(Nonlinear and Linear 
Programming) 

Stephen J. Wright 


1 Overview 

At the core of any optimization problem is a mathe- 
matical model of a system. This model could be con- 
structed according to physical, economic, behavioral, 
or statistical principles, and it describes relationships 
between variables that define the state of the system; it 
may also place restrictions on the states, in the form of 
constraints on the variables. The model also includes 
an objective function that measures the desirability of 
a given set of variables. The optimization problem is to 
find the set of variables that achieves the best possible 
value of the objective function, among all those values 
that satisfy the given constraints. 

1.1 Examples 

Optimization problems are ubiquitous, as we illustrate 
with some examples. 

(1) A firm wishes to maximize its profit, given con- 
straints on availability of resources (equipment, labor, 
raw materials), production costs, and forecast demand. 

(2) In order to forecast weather, we first need to solve 
a problem to identify the state of the atmosphere a 
few hours ago. This is done by finding the state that 
is most consistent with recent meteorological observa- 
tions (temperature, wind speed, humidity, etc.) taken 
at a variety of locations and times. The model con- 
tains differential equations that describe evolution of 
the atmosphere, statistical elements that describe prior 
knowledge of the atmospheric state, and an objective 
function that measures the consistency between the 
atmospheric state and the observations. 

(3) Computer systems for recognizing handwritten dig- 
its contain models that read the written character, in 
the form of a pixelated image, and output their best 
guess as to the digit that is represented in the image. 
These models can be “trained” by presenting them with 
a (typically large) set of images containing known dig- 
its. An optimization problem is solved to adjust the 
parameters in the model so that the error count on the 
training set is minimized. If the training set is represen- 
tative of the images that the system will see in future, 
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this optimized model can be trusted to perform reliable 
digit recognition. 

(4) Given a product that is produced in a number of 
cities and consumed in other cities, we wish to find the 
least expensive way to transport the product from sup- 
ply locations to demand locations. Here, the model con- 
sists of a graph that describes the transportation net- 
work, capacity constraints, and the cost of transporting 
one unit of the product between two adjacent locations 
in the network. 

(5) Given a set of possible investments along with the 
means, variances, and correlations of their expected 
returns, an investor wishes to allocate his or her funds 
in a way that balances the expected mean return of the 
portfolio with its variance, in a way that fits his or her 
appetite for risk. 

These examples capture some of the wide variety of 
applications currently seen in the field. As they suggest, 
the mathematical models that underlie optimization 
problems vary widely in size, complexity, and struc- 
ture. They may contain simple algebraic relationships, 
systems of ordinary or partial differential equations, 
models derived from Bayesian statistics, and “black- 
box” models whose internal details are not accessi- 
ble and can be accessed only by supplying inputs and 
observing outputs. 

1.2 Continuous Optimization 

In continuous optimization, the variables in the model 
are nominally allowed to take on a continuous range of 
values, usually real numbers. This feature distinguishes 
continuous optimization from discrete or combinato- 
rial optimization, in which the variables may be binary 
(restricted to the values 0 and 1), integer (for which only 
integer values are allowed), or more abstract objects 
drawn from sets with finitely many elements, (discrete 
optimization [IV.38] is the subject of another article 
in this volume.) 

The algorithms used to solve continuous optimiza- 
tion problems typically generate a sequence of values 
of the variables, known as iterates, that converge to a 
solution of the problem. In deciding how to step from 
one iterate to the next, the algorithm makes use of 
knowledge gained at previous iterates, and information 
about the model at the current iterate, possibly includ- 
ing information about its sensitivity to perturbations 
in the variables. The continuous nature of the prob- 
lem allows sensitivities to be defined in terms of first 


and second derivatives of the functions that define the 
models. 

1.3 Standard Paradigms 

Research in continuous optimization tends to be orga- 
nized into several paradigms, each of which makes cer- 
tain assumptions about the properties of the objective 
function, variables, and constraints. To define these 
paradigms, we group the variables into a real vector 
x with n components (that is, x £ 1") and define the 
general continuous optimization problem as follows: 

min f(x) (1 a) 

JtEl" 

subject to Ct(x) = 0, i e 1, (1 h) 

Ci(x) ^ 0, i el, (1 c) 

where the objective / and the constraints cu i e 1 u 2, 
are real-valued functions on M. n . To this formulation is 
sometimes added a geometric constraint 

x e O, (2) 

where Q c M fl is a closed convex set. (Any nonconvex- 
ity in the feasible set is conventionally captured by the 
algebraic constraints (1 b) and (1 c) rather than the geo- 
metric constraint (2).) All functions in (1) are assumed 
to at least be continuous. A point x that satisfies all the 
constraints is said to be feasible. 

There is considerable flexibility in the way that a 
given optimization problem can be formulated; the 
choice of formulation has a strong bearing on the 
effectiveness with which the problem can be solved. 
One common reformulation technique is to replace an 
inequality constraint by an equality constraint plus a 
bound by introducing a new “slack” variable: 

Ci(x) ^ 0 <=> Ci(x) + Si = 0, Si ^ 0. 

Referring to the general form (1), we distinguish 
several popular paradigms. 

• In linear programming, the objective function and 
the constraints are affine functions of x; that is, 
they have the form a T x + b for some a El" and 
b El. 

• In quadratic programming, we have 

fix) = jX t Qx + c T x + d 

for some nx n symmetric matrix Q, vector c e 
R”, and scalar d e M, while all constraints c* are 
linear. When Q is positive-semidefinite, we have a 
convex quadratic program. 
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• In convex programming, the objective f and the 
negated inequality constraint functions -C;, i G 
2, are convex functions, while the equality con- 
straints Ci, i G X, are affine functions. (These 
assumptions, along with the convexity and closed- 
ness of O in the case when (2) is included in the 
formulation, imply that the set of feasible points 
is closed and convex.) 

• Conic optimization problems have the form (1), (2), 
where the set Q is assumed to be a closed, con- 
vex cone that is pointed (that is, it contains no 
straight line), while the objective / and equality 
constraints a, i G X, are assumed to be affine. 
There are no inequalities; that is, 2 = 0. 

• In unconstrained optimization, there are not any 
constraints (1 b), (1 c), and (2), while the objective 
/ is usually assumed to be smooth, with at least 
continuous first derivatives. Nonsmooth optimiza- 
tion allows / to have discontinuous first deriva- 
tives, but it is often assumed that / has some other 
structure that can be exploited by the algorithms. 

• In nonlinear programming, the functions / and c,-, 
i G X u 1, are generally nonlinear but smooth, at 
least having continuous first partial derivatives on 
the region of interest. 

An important special class of conic optimization 
problems is semidefinite programming, in which the 
vector x of unknowns contains the elements of a sym- 
metric mx m matrix X that is required to be positive- 
semidefinite. It is natural and useful to write this 
problem in terms of the matrix X as follows: 

min C • X (3 a) 

XESR mxm 

subject to Ai • X = bi, £=1,2 p, (3 b) 

X > 0. (3 c) 

Here, SR mxm denotes the set of symmetric mx m 
matrices, the matrices C and A,, i = 1,2 ,...,p, all 
belong to SR mxm , and the operator • is defined on pairs 
of matrices in 5R mxm as follows: 

m m 

X • Z = ^ ^ jZij = trace(A'Z). 
i=lj=l 

The constraint (3 c) instantiates the geometric con- 
straint (2). 

Terminology 

“Mathematical programming” is a historical term that 
encompasses optimization and closely related areas 
such as complementarity problems. Its origins date to 


the 1940s, with the development of the simplex method 
of George Dantzig, the first effective method for lin- 
ear programming (7). The term “programming” origi- 
nally referred to the formalized, systematic mathemat- 
ical procedure by which problems can be solved. Only 
later did “programming” become roughly synonymous 
with “computer programming,” causing some confu- 
sion that optimization researchers have often been 
called on to explain. The more modern term “optimiza- 
tion” is generally preferred, although the term “pro- 
gramming” is still attached (probably forever) to such 
problems as linear programming and integer program- 
ming. 

1.4 Scope of Research 

Research in optimization encompasses 

• study of the mathematical properties of the prob- 
lems themselves; 

• development, testing, and analysis of algorithms 
for solving particular classes of problems (such as 
one of the paradigms described above); and 

• development of models and algorithms for spe- 
cific application areas. 

We give a brief description of each of these aspects. 

One topic of fundamental interest is the characteri- 
zation of solution sets: are there verifiable conditions 
that we can check to determine whether a given point 
is a solution to the optimization problem? Given the 
uncertainty that is present in many practical settings, 
we may also be interested in the sensitivity of the solu- 
tion to perturbations in the data or in the objective 
and constraint functions. Ill-conditioned problems are 
those in which the solution can change significantly 
when the data or functions change slightly. Another 
important fundamental concept is duality. Often, the 
data and functions that define an optimization prob- 
lem can be rearranged to produce a new “dual” prob- 
lem that is related to the original problem in interesting 
ways. The concept of duality can also be of great practi- 
cal importance in designing more efficient formulations 
and algorithms. 

The study of algorithms for optimization problems 
blends theory and practice. The design of algorithms 
that work well on practical problems requires a good 
deal of intuition and testing. Most algorithms in use 
today have a solid theoretical basis, but the theory 
often allows wide latitude in the choice of certain 
parameters, and algorithms are often “engineered” to 
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find suitable values for these parameters and to incor- 
porate other heuristics. Analysis of algorithms tackles 
such issues as whether the iterates can be guaranteed 
to converge to a solution (or some other point of inter- 
est); whether there is an upper bound on the number of 
iterations needed, as a function of the size or complex- 
ity of the problem; and the rate of convergence, partic- 
ularly after the iterates enter a certain neighborhood 
of the solution. Algorithmic analysis is typically worst- 
case in nature. It gives important indications about how 
the algorithm will behave in practice, but it does not 
tell the whole story. Famously, the simplex method 
(described in section 3.1) is known to perform badly 
in the worst case — its running time may be exponen- 
tial in the problem size— yet its performance on most 
practical problems is impressively good. 

Development of software that implements efficient 
algorithms is another important activity. High-quality 
codes are available both commercially and in the pub- 
lic domain. Modeling tools— high-level languages that 
serve as a front end to algorithmic software packages — 
have become more popular in recent years. They relieve 
users of much of the burden of transforming their prac- 
tical problem to a set of functions (the objective and 
constraint functions in (1)), allowing the model to be 
expressed in intuitive terms, related more directly to 
the application. 

With the growth in the size and complexity of practi- 
cal optimization problems, issues of modeling, formu- 
lation, and customized algorithm design have become 
more prominent. A particular application can be for- 
mulated as an optimization problem in many different 
ways, and different formulations can lead to very differ- 
ent solver performance. Experience and testing is often 
required to identify the most effective formulation. 

Many modern applications cannot be solved effec- 
tively with packaged software for one of the standard 
paradigms of section 1.3. It is necessary to assemble a 
customized algorithm, drawing on a variety of algorith- 
mic elements from the optimization toolbox and also 
on tools from other disciplines in scientific computing. 
This approach allows the particular structure or con- 
text of the problem to be exploited. Examples of special 
context include the following. 

• Low-accuracy solutions may suffice for some prob- 
lems. 

• Algorithms that require less data movement— or 
sampling from a large data set, or the ability to 
handle streaming data — may be essential in other 
settings. 


• Algorithms that produce (possibly suboptimal) 
solutions in real time may be essential in such 
contexts as industrial control. 

1.5 Connections 

Continuous optimization is a highly interconnected 
discipline, having close relationships with other areas 
of mathematics, with scientific computing, and with 
numerous application areas. It also has close connec- 
tions to discrete optimization, which often requires 
continuous optimization problems to be solved as 
subproblems or relaxations. 

In mathematics, continuous optimization relies heav- 
ily on various forms of mathematical analysis, espe- 
cially real analysis and functional analysis. Certain 
types of analysis have been developed in close asso- 
ciation with the discipline of optimization, including 
convex analysis, nonsmooth analysis, and variational 
analysis. The theory of computational complexity also 
plays a role in the study of algorithms. Game theory is 
particularly relevant when we examine duality and opti- 
mality conditions for optimization problems. Control 
theory is also relevant: for framing problems involv- 
ing dynamical models and as an important source of 
applications for optimization. Statistics provides vital 
tools for stochastic optimization and for optimization 
in machine learning, in which the model is available 
only through sampling from a data set. 

Continuous optimization also intersects with many 
areas in numerical analysis and scientific computing. 
Numerical linear algebra is vitally important, since 
many optimization algorithms generate a sequence of 
linear approximations, and these must be solved with 
linear algebra tools. Differential equation solvers are 
important counterparts to optimization in applications 
such as data assimilation and distributed parameter 
identification, which involve optimization of ordinary 
differential equation and partial differential equation 
models. The ubiquity of multicore architectures and the 
wide availability of cluster computing have given new 
prominence to parallel algorithms in some areas (such 
as machine learning), requiring the use of software 
tools for parallel computing. 

Finally, we mention some of the many connections 
between optimization and the application areas within 
which it has become deeply embedded. Machine learn- 
ing uses optimization algorithms extensively to per- 
form classification and learning tasks. The challenges 
posed by machine learning applications (e.g., large 
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data sets) have driven recent developments in stochas- 
tic optimization and large-scale unconstrained opti- 
mization. Compressed sensing, in which sparse signals 
are recovered from randomized encodings, also relies 
heavily on optimization formulations and specialized 
algorithms. Engineering control is a rich source of chal- 
lenging optimization problems at many scales, fre- 
quently involving dynamic models of plant processes. 
In these and many other areas, practitioners have made 
important contributions to all aspects of continuous 
optimization. 

2 Basic Principles 

We mention here some basic theory that underpins con- 
tinuous optimization and that serves as a starting point 
for the algorithms outlined in later sections. 

Possibly the most fundamental issues are how we 
define a solution to a problem and how we recognize a 
given point as a solution. The answers become more 
complicated as we expand the classes of functions 
allowed in the formulation. The type of solution most 
amenable to analysis is a local solution. The point x* is 
a local solution of (1) if x* is feasible and if there is an 
open neighborhood JV of x* such that fix) ^ fix*) 
for all feasible points x e !H . Furthermore, x* is a strict 
local solution if fix) > fix*) for all feasible x e !H 
with x =t= x*. A global solution is a point x* such that 
fix) ^ fix*) for all feasible x. 

As we see below, we can use the derivatives of the 
objective and constraint functions to construct testable 
conditions that verify that x* is a local solution under 
certain assumptions. It is difficult to verify global opti- 
mality, even when the objective and constraint func- 
tions are smooth, because of the difficulty of gaining 
a global perspective on these functions. However, in 
convex optimization, where the objective / is a con- 
vex function and the set of feasible points is also con- 
vex, all local solutions are global solutions. (Convex 
optimization includes linear programming and conic 
optimization as special cases.) 

Global optimization techniques have also been de- 
vised for certain classes of nonconvex problems. It is 
possible to prove results about the performance of 
such methods when the function f satisfies additional 
properties (such as Lipschitz continuity, with known 
Lipschitz constant) and the feasible region is bounded. 
One class of methods for solving the global optimiza- 
tion problem uses a process of subdividing the feasi- 
ble region and using information about / to obtain a 


lower bound on the objective in that region, leading to 
a branch-and-bound algorithm akin to methods used in 
integer programming. 

We now turn to characterizations of local solutions 
for problems defined by smooth functions, assuming 
for simplicity that f and cu i £ f ul, have continu- 
ous second partial derivatives. We use V fix) to denote 
the gradient of / (the vector [3//3xi]” =1 of first par- 
tial derivatives) and V 2 /(x) to denote the Hessian of f 
(the nx n matrix [3 2 //3x*3xj]" J=1 of second partial 
derivatives). An important tool, both in the character- 
ization of solutions for smooth problems and in the 
design of algorithms, is Taylor’s theorem. This result 
can be used to estimate the value of / by using its 
derivative information at a nearby point. For example, 
we have 

fix + p) = fix) + V/(x) T p + o(Hpll), (4a) 
fix + p) = fix) + V/(x) T p 

+ |p T V 2 /(x)p + o(||p|| 2 ), (4 b) 

where the notation o(t) indicates a quantity that goes 
to zero faster than t. These formulas can be used to 
construct low-order approximations to (1) that are valid 
in the neighborhood of a current iterate x and can thus 
be used to identify a possibly improved iterate x + p. 

For unconstrained optimization of a smooth function 
/, we have the following necessary condition. 

If x* is a local solution of min* fix), then V fix*) =0. 

Note that this is only a necessary condition; it is possi- 
ble to have V fix) = 0 without x being a minimizer. (An 
example is the scalar function fix) = x 3 , which has no 
minimizer but which has V /( 0) = 0.) To complement 
this result, we have the following sufficient condition. 

If x* is a point such that V/(x*) = 0 with V 2 /(x*) 

positive-definite, then x* is a strict local solution of 

min* fix). 

Turning to constrained optimization— the general 
form (1), with smooth functions— characterization of 
local solutions becomes somewhat more complex. A 
central role is played by the Lagrangian function, 
defined as follows: 

£(x,\) = f(x) - X A idix). (5) 

iG'Eut 

This is a linear combination of objective and constraint 
functions, where the weights A; are called Lagrange 
multipliers. The following set of conditions, known as 
the Karush-Kuhn-Tucker conditions or KKT conditions 



286 


IV. Areas of Applied Mathematics 


after their inventors, are closely related to local opti- 
mality of x* for the problem (1): there exist A*, i e 
£ u 2, such that 


V x £(x*,A*) = 0, 


(6 a) 

aix*) = o, 

i E T, 

(6 b) 

aix*) ^ o, 

iel, 

(6 c) 

A* > 0, 

iel, 

(6 d) 

A*Ci(x*) = 0, 

i el. 

(6 c) 


Condition (6 e) is a complementarity condition that indi- 
cates complementarity between each inequality con- 
straint value Ci (x* ) and its Lagrange multiplier A* ; for 
each i, at least one of these two quantities must be zero. 
Roughly speaking, the Lagrange multipliers measure 
the sensitivity of the optimal objective value fix*) to 
perturbations in the constraints Cj. When x* is a local 
solution of (1), the KKT conditions will hold, provided 
an additional condition called a constraint qualification 
is satisfied. The constraint qualification requires the 
true geometry of the feasible set near x* to be captured 
by linear approximations to the constraint functions 
around x*. 

When the functions in (1) are nonsmooth, it becomes 
harder to define optimality conditions, as even the con- 
cept of derivative becomes more complicated. We con- 
sider the simplest problem of this type: the uncon- 
strained problem min Y f(x), where / is a convex (pos- 
sibly nonsmooth) function. The subdifferential of / at 
a point x is defined from the collection of supporting 
hyperplanes to / at x: 

3/(x) := {v. f(z ) ^ f{x) + v r (z - x) 

for all z in the domain of /}. 

For example, the function fix) = ||x||i = Xf=i \ x i 
is nonsmooth, with subdifferential consisting of the 
vectors v such that 

Vi = + 1 if X; > 0, 

Vi G [ — 1, 1] if X; = 0, 
l'i — 1 if X; < 0. 

When f is smooth at x in addition to being convex, 
we have 3 fix) = {V fix) }. A necessary and sufficient 
condition for x* to be a solution of min v fix) is that 
0 e 3 fix*). 

3 Linear Programming 

Consider the problem 

minc T x subject to Ax = b, x ^ 0, (7) 


where x 6 1" as before, b e R m is the right-hand 
side, and A e R” 1 *'" is the constraint matrix. Any opti- 
mization problem with an affine objective function and 
affine constraints can, after some elementary transfor- 
mations, be written in this standard form. As illus- 
trated by the example in figure 1, the feasible region 
for the problem (7) is polyhedral, and the contours of 
the objective function are lines. 

There are three possible outcomes for a linear pro- 
gram. 

(a) The problem is infeasible ; that is, there is no point 
x that satisfies Ax = b and x ^ 0. 

(b) The problem is unbounded ; that is, there is a 
sequence of feasible points x k such that c T x k 1 

— OO. 

(c) The problem has a solution; that is, there is a feasi- 
ble point x* such that c T x* ^ c T x for all feasible 

x. 

When a solution exists (case (c)), it may not be uniquely 
defined. However, we can note that the set of solutions 
itself forms a polyhedron and that at least one solution 
lies at a vertex of this polyhedron, that is, at a point that 
does not lie in the interior of a line joining any other 
two points in the polyhedron. 

By rearranging the data in (7), we obtain another 
linear program called the dual. 

maxb T A subject to A T A + s = c, s ^ 0. (8) 

A, s 

(In discussions of duality, the original problem (7) is 
called the primal problem.) The primal and dual prob- 
lems are related by a powerful duality theory that has 
important practical implications. Weak duality > states 
that, if x is a feasible point for (7) and (A, 5) is a fea- 
sible point for (8), then the primal objective is greater 
than or equal to the dual objective. This statement is 
easily proved in a single line: 

c T x = (A T A + s) T x ^ A t Ax = A T b. 

The other fundamental duality result— strong duality >— 
states that there are three possible outcomes for the 
pair of problems (7) and (8): 

(a) one of the two problems is infeasible and the other 
is unbounded; 

(b) both problems are infeasible; or 

(c) (7) has a solution x* and (8) has a solution (A*, s*) 
with objective functions equal (that is, c T x* = 
b T A*). 
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Figure 1 Feasible region (unshaded), objective function con- 
tours (dashed lines), and optimal vertex x* for a linear 
program in two variables. 

Specializing (6) to the case of linear programming, we 
see that the primal and dual problems share a common 
set of KKT conditions: 

Ax = b, A T A + 5 = c, (9 a) 

x ^ 0, 5^0, (9 b) 

XiSi = 0, i = 1, 2, . . . , n. (9 c) 

If (x*,A*,s*) is any vector triple that satisfies these 
conditions, x* is a solution of (7) and (A*, 5*) is a 
solution of (8). 

We now discuss the two most important classes of 
algorithms for linear programming. 

3.1 The Simplex Method 

The simplex method, devised by George Dantzig in 
the 1940s, remains a fundamental approach of practi- 
cal and theoretical importance in linear programming. 
Geometrically speaking, the simplex method moves 
from vertex to neighboring vertex of the feasible set, 
decreasing the objective function with each move and 
terminating when it cannot find a neighboring vertex 
with a lower objective value. The method is imple- 
mented by maintaining a basis: a subset of m out of the 
n components of x that are allowed to be nonzero at 
the current iteration. The values of these basic compo- 
nents of x are determined uniquely by the m linear con- 
straints Ax = b. Each step of the simplex method starts 
by choosing a nonbasic variable to enter the basis. This 


variable is allowed to increase away from zero, a pro- 
cess that, because of the requirement to maintain fea- 
sibility of the linear constraints Ax = b, causes the val- 
ues of the existing basic variables to change. The enter- 
ing variable is allowed to increase to the point where 
one of the basic variables reaches zero, upon which it 
leaves the basis, and the iteration is complete. 

Efficient implementation of the simplex method 
depends both on good “pricing” strategies (to choose 
which nonbasic variable should enter the basis) and on 
efficient linear algebra (to update the values of the basic 
variables as the entering variable increases away from 
zero). Both topics have seen continued development 
over the years, and highly effective software is avail- 
able, both commercially and in the public domain. Spe- 
cialized, highly efficient versions of the simplex method 
exist for some special cases of linear programming, 
such as those arising from transportation or routing 
over networks. 

The simplex method is an example of an active-set 
method: a subset of the inequality constraints (which 
are the bounds x ^ 0) is held to be active at each iter- 
ation. In the simplex method, the active set consists of 
the nonbasic variables, those indices i for which Xj = 0. 
This active set changes only slightly from one iteration 
to the next; in fact, the simplex method changes just a 
single component of the active (nonbasic) set at each 
iteration. 

The theoretical properties of the simplex method 
remain a source of fascination because, despite its prac- 
tical efficiency, its worst-case behavior is poor. In an 
article from 1972, Klee and Minty famously showed 
that the number of steps may be exponential in the 
dimension of the problem. There have been various 
attempts to understand the “average-case” behavior, in 
which the number of iterations required is roughly lin- 
ear in the problem dimensions, that is, the numbers of 
variables and constraints. The “smoothed analysis” of 
Spielman and Teng shows that any linear program for 
which the simplex method behaves badly can be mod- 
ified, with small perturbations, to become a problem 
that requires only polynomially many iterations. 

An algorithm with polynomial complexity (in the 
worst case) was announced in 1979: Khachiyan's ellip- 
soid algorithm. Though of great theoretical interest, it 
was not a practical alternative to simplex. The interior- 
point revolution began with Karmarkar's algorithm 
(in 1984); this is also a polynomial- time approach, 
but it has much better computational properties than 
the ellipsoid approach. It motivated a new class of 
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algorithms— primal-dual interior-point methods— that 
not only had attractive theoretical properties but were 
also truly competitive with the simplex method on 
practical problems. We describe these next. 


3.2 Interior-Point Methods 


As their name suggests, primal-dual interior-point 
methods generate a sequence of iterates (x k ,\ k ,s k ), 
k = 1 , 2 ,..., in both primal and dual variables, in 
which x k and s k contain all positive numbers (that is, 
they are strictly feasible with respect to the constraints 
x ^ 0 and s ^ 0 in (9 h)). Steps between iterates are 
obtained by applying Newton’s method to a perturbed 
form of the condition (9 c) in which the right-hand side 
0 is replaced by a positive quantity Pk > 0, which is 
gradually decreased to zero as k — oo. The Newton 
equations for each step (Ax k , A\ k , As k ) are obtained 
from a linearization of these perturbed KKT conditions: 
specifically, 



Ax k 


A T A k + s k - c 


A\ k 

= - 

Ax k - h 


<1 


X k S k e-p k l 


where X k is the diagonal matrix whose diagonal ele- 
ments come from x k , S k is defined similarly, and 1 is 
the vector of length n whose elements are all 1. The 
new iteration is obtained by setting 


(x k+1 ,A k+1 ,<r fc+1 ) 

= (x k + <XkAx k ,\ k + /IfcAA k ,s k + pkAs k ), 

where cxk and pk are step lengths in the range [0,1] cho- 
sen so as to ensure that x k+1 > 0 and 5 k+1 > 0, among 
other goals. Convergence, with polynomial complex- 
ity, can be demonstrated under appropriate schemes 
for choosing pk and the step lengths (Xk and fik- 
Clever schemes for choosing these parameters and for 
enhancing the search directions using “second-order 
corrections” lead to good practical behavior. 

Primal-dual interior-point methods have the addi- 
tional virtue that they are easily extendible to convex 
quadratic programming and monotone linear comple- 
mentarity problems, with only minor changes to the 
algorithm and the convergence theory. As we see in sec- 
tion 6.3, they can also be extended to general nonlinear 
programming, though the modifications in this case are 
more substantial and the convergence guarantees are 
weaker. 


4 Unconstrained Optimization 

Consider the problem of simply minimizing a function 
without constraints: 

min/ (x), 

where / has at least continuous first derivatives. This 
problem is important in its own right. It also appears 
as a subproblem in many methods for constrained opti- 
mization, and it serves to illustrate several algorithmic 
techniques that can also be applied to the constrained 
case. 

4.1 First-Order Methods 

The Taylor approximation (4 a) shows that f decreases 
most rapidly in the direction of the negative gradient 
vector -V/(x). Steepest-descent methods move in this 
direction, each iteration having the form 

x k+1 = x k - oqV/(x k ), 

for some positive step length c*k- A suitable value of 
c*k can be found by performing (approximately) a one- 
dimensional search along the direction - V/(x k ), thus 
guaranteeing a decrease in / at every iteration. When 
further information about f is available, it may be pos- 
sible to choose (Xk to guarantee descent in / without 
doing a line search. A nonstandard approach, which 
first appeared in a 1988 paper by Barzilai and Borwein, 
chooses (Xk according to a formula that allows f to 
increase (sometimes dramatically) on some iterations, 
while often achieving better long-term behavior than 
standard steepest-descent approaches. 

For the case of convex /, there has been renewed 
focus on accelerated first-order methods that still 
require only the calculation of a gradient V/ at each 
step but that have more attractive convergence rates 
than steepest descent, both in theory and in practice. 
The common aspect of these methods is a “momen- 
tum” device, in which the step from x k to x k+1 is based 
not just on the latest gradient V f(x k ) but also on the 
step from the previous iterate x k 1 to the current iter- 
ate x k . In hemy-ball and conjugate gradient methods, 
the steps have the form 

x k+l _ x k _ ai fe yj(x k ) + p k (x k - X k_1 ) 

for positive parameters (Xk and /3fe, which are chosen in 
a variety of ways. Accelerated methods that have been 
proposed more recently also make use of momentum 
together with the latest gradient V/(x k ), but combine 
these factors in different ways. Some methods separate 
the steepest-descent steps from the momentum steps, 
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alternating between these two types of steps to produce 
two interleaved sequences of iterates, rather than the 
single sequence (x k }. 

For convex /, first-order methods are characterized 
in some cases by linear convergence (with the error in 
x k decreasing to zero in a geometric sequence) or sub- 
linear convergence (with the error decreasing to zero 
but not geometrically, typically at a rate of 1 / k or 1 / k 2 , 
where k is the iteration number). 

4.2 Superlinear Methods 

When second derivatives of / are available, we can use 
the second-order Taylor approximation (4 b) to moti- 
vate Newton’s method, a fundamental algorithm in both 
optimization and nonlinear equations. When V 2 / is 
positive-definite at the current iterate x k , we can define 
the step p k to be the minimizer of the right-hand side 
of (4 b) (with the o(||p|| 2 ) term omitted), yielding the 
formula 

p k = ~[V 2 f(x k )]- 1 Vf(x k ). (10) 

The next iterate is defined by choosing a step length 
oik > 0 and setting x k+1 = x k + ctkp k . This method is 
characterized by quadratic convergence, in which the 
error in x k+1 is bounded by a constant multiple of the 
square of the error in x k , for all x k sufficiently close to 
the solution x*. (The number of digits of agreement 
between x k and x* doubles at each of the last few 
iterations.) 

Enhancements of the basic approach based on (10) 
yield more robust and general implementations. For 
example, the Hessian matrix V 2 f(x k ) may be modi- 
fied during the computation of p k to ensure that it is 
a descent direction for /. Another important class of 
methods known as quasi-Newton methods avoids the 
calculation of second derivatives altogether, instead 
replacing V 2 f(x k ) in (10) by an approximation B k that 
is constructed using first-derivative information. The 
possibility of such an approximation is a consequence 
of another form of Taylor’s theorem, which posits the 
following relationship between two successive gradi- 
ents: 

Vf(x k+l ) - Vf(x k ) ~ V 2 f(x k )(x k+1 -x k ). 

In updating the Hessian approximation to B k +\ after 
the step to x k+1 is taken, we ensure that Bk+i mimics 
this property of the true Hessian; that is, we enforce 
the condition 

V f(x k+1 ) - Vf(x k ) « B k+l (x k+1 - x k ). 


We obtain a variety of quasi-Newton methods by 
imposing various other conditions on B k +i- close- 
ness to B k in some metric, for example, and positive- 
semidefiniteness. Limited-memory quasi-Newton meth- 
ods store Bk implicitly by means of the difference vec- 
tors between successive iterates and successive gradi- 
ents at a limited number of prior iterations (typically 
between five and twenty). 

4.3 Derivative-Free Methods 

Methods that require the user to supply only function 
values / (and not gradients or Hessians) have been 
enormously popular for many years. More recently, 
they have attracted the attention of optimization re- 
searchers, who have tried to improve their perfor- 
mance, equip them with a convergence theory, and 
customize them to certain specific classes of prob- 
lems, such as problems in which / is obtained from 
a simulation. 

In the absence of gradient or Hessian values, it is 
sometimes feasible to use finite differencing to con- 
struct approximations to these higher-order quantities 
and then apply the methods described above. Another 
possible option is to use algorithmic differentiation to 
obtain derivatives directly from computer code and, 
once again, use them in the algorithms described above. 

Methods that use only function values are usually 
best suited to problems of modest dimension n. Model- 
based methods use interpolation among function val- 
ues at recently visited points to construct a model of 
the function f. This model is used to generate a new 
candidate, which is accepted as the next iterate if it 
yields a sufficient improvement in the function value 
over the best point found so far. The model is updated 
by changing the set of points on which the interpola- 
tion is based, replacing older points with higher val- 
ues of / by newer points with lower function val- 
ues. Pattern-search methods take candidate steps along 
a fixed set of directions, shrinking step lengths as 
needed to evaluate a new iterate with a lower function 
value. After a successful step, the step length may be 
increased, to speed future progress. Appropriate main- 
tenance of the set of search directions is crucial to 
efficient implementation and valid convergence theory. 
Another derivative-free method is the enormously pop- 
ular simplex method of Nelder and Mead from 1965. 
This method— unrelated to the method of the same 
name for linear programming— maintains a set of n + 1 
points that form the vertex of a simplex in R n . At each 
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iteration, it replaces one of these points with a new one, 
by expanding or contracting the simplex along promis- 
ing directions or reflecting one of the vertices through 
its opposite face. Various attempts have been made 
in the years since to improve the performance of this 
method and to develop a convergence theory. 

4.4 Stochastic Gradient Methods 

Important problems have recently been identified for 
which evaluation of V/ or even / is computationally 
expensive but where it is possible to obtain an unbiased 
estimate of V/ cheaply. Such problems are common in 
data analysis, where / typically has the form 

fM = XfiM 

for large N, where each /, depends on a single item 
in the data set. If i is selected at random from 
{1, 2, ... ,N}, the vector g k = Vfi(x k ) is an unbiased 
estimate of V f(x k ). For convex /, methods that use 
this approximate gradient information have been a 
focus of work in the optimization and machine learning 
communities for some years, and efforts have recently 
intensified as their wide applicability has become evi- 
dent. The basic iteration has the form x k+1 = x k - 
(\kg k , where the choice of g k may be based on addi- 
tional information about /, such as lower and upper 
bounds on its curvature. (Line searches are not practi- 
cal in this setting, as evaluation of / is assumed to be 
too expensive.) Additional devices such as averaging of 
the iterates x k or the gradient estimates g k enhance 
the properties of the method in some settings, such 
as when f is only weakly convex. Typical convergence 
analysis shows that the expected value of the error in 
x k , or of the difference between the function value after 
k iterations and its optimal value, approaches zero at 
a sublinear rate, such as 0(1 Ik) or 0(1/ Vk). 

5 Conic Optimization 

Conic optimization problems have the form 
minc T x subject to Ax = b, x e Q, 

where fiis a closed, convex, pointed cone. They include 
linear programming (7) and semidefinite programming 
(3) as special cases. It is possible to design generic algo- 
rithms with good complexity properties for this prob- 
lem class provided that we can identify a certain type 
of barrier function for Q. A barrier function qp is con- 
vex with domain the interior of Q, with qp (x) — co as x 


approaches the boundary of Q. The additional property 
required for an efficient algorithm is self-concordancy, 
which is the property that for any x in the domain of 
qp, and any v e 1”, we have that 

|t'"(0)| 2|t"(0)| 3/2 , 

where t(<x) := qp(x+ ixv). Because the third derivatives 
are bounded in terms of the second derivatives, the 
function qp is well approximated (locally at least) by a 
quadratic, so we can derive complexity bounds on New- 
ton’s method applied to qp, with a suitable step-length 
scheme. We can use this barrier function to define 
an interior-point method in which each iterate x k is 
obtained by finding an approximate minimizer of the 
following equality-constrained optimization problem: 

mjnc T x + p^cpjx) subject to Ax = b, 

where the positive parameter pk can be decreased grad- 
ually to zero as k increases, as in interior-point meth- 
ods for linear programming. One or more steps of New- 
ton's method can be used to find the approximate solu- 
tion to this subproblem, starting from the previous 
iterate. 

For linear programming, the cone fl = {x | x > 
0} admits a self-concordant barrier function qp(x) = 
- Z”=i logXj. In semidefinite programming, where Q 
is the cone of positive-semidefinite matrices, we have 
qp(X) = -logdetX. 

In practice, the most successful interior-point meth- 
ods for semidefinite programming are primal-dual 
methods rather than primal methods. These methods 
are (nontrivial) extensions of the linear programming 
approaches of section 3.2. 

6 Nonlinear Progra mmi n g 

Next we turn to methods for nonlinear programming, 
in which the functions / and c, in (1) are smooth non- 
linear functions. A basic principle used in construct- 
ing algorithms for this problem is successive approxi- 
mation of the nonlinear program by simpler problems, 
such as quadratic programming or unconstrained opti- 
mization, to which methods from the previous sections 
can be applied. Taylor’s theorem is instrumental in con- 
structing these approximations, using first- or second- 
order expansions of functions around the current iter- 
ate x k and possibly also the current estimates of the 
Lagrange multipliers for the constraints (1 b) and (1 c). 
The optimality conditions described in section 2 also 
play a central role in algorithm design. 
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6.1 Gradient Projection 

Gradient projection is an extension of the steepest- 
descent approach for unconstrained optimization in 
which steps are taken along the negative gradient direc- 
tion but projected onto the feasible set. Considering the 
formulation 

min/(x) subject to x 6 O, 
the basic gradient projection step is 

x k+1 =P n (x k -<x k Vf(x k )), 

where Pq ( ■ ) denotes projection onto the closed con- 
vex constraint set Q. This approach may be practical 
if the projection can be computed inexpensively, as is 
the case when Q is a “box” defined by bounds on the 
variables. It is possible to enhance the gradient method 
by using second-order information in a selective way 
(simple projection of the Newton step does not work). 

6.2 Sequential Quadratic Programming 

In sequential quadratic programming we use Taylor’s 
theorem to form the following approximation of (1) 
around the current point x k : 

min V f(x k ) r d + ld J Hkd (11 a.) 

dem n 

subject to Ci(x k ) + Vci(x k ) T d = 0, i £f, (11 b) 
Ci(x k ) + Vci(x k ) T d ^ 0, i 6 1, (11c) 

where Hu is a symmetric matrix. Denoting the solution 
of (11) by d k , the next iterate is obtained by setting 

x k+1 = x k + akd k 

for some step length > 0. The subproblem (11) is 
a quadratic program; it can be solved with methods 
of active-set or interior-point type. The matrix II) may 
contain second-order information from both objective 
and constraint functions; an “ideal” value is the Hes- 
sian of the Lagrangian function defined in (5), that is, 
Hk = Vj x £(x k , A k ), where \ k are estimates of the 
Lagrange multipliers, obtained for example from the 
solution of the subproblem (11) at the previous iter- 
ation. When second derivatives are not readily avail- 
able, H k could be a quasi-Newton approximation to 
the Lagrangian Hessian, updated by formulas similar 
to those used in unconstrained optimization. 

A line search can be performed to find a suitable 
value of ak in the update step. An alternative approach 
to stabilizing sequential quadratic programming is to 
add a “trust region” to the subproblem (11), in the form 
of a constraint ||d|U ^ Ak, for some Ak > 0. 


6.3 Interior-Point Methods 

The interior-point methods for linear programming 
described in section 3.2 can be extended to nonlinear 
programming, and software based on such extensions 
has been highly successful. To avoid notational clutter, 
we consider a formulation of nonlinear programming 
containing nonnegativity constraints on x along with 
equality constraints: 

min/(x) subject to Cj(x) = 0, j e £; x ^ 0. (12) 

(This problem is no less general than (1); simple trans- 
formations can be used to express (1) in the form (12).) 
Following (6), and introducing an additional vector s in 
the style of (9), we can write the optimality conditions 
for this problem as 


V/(x) - X AVCy(x) -5 = 0, 

(13a) 



Cj (x ) = 0, jef, 

(13 b) 

X ^ 0, 5^0, 

(13c) 

X;5i = 0, i = 1,2, ... ,n. 

(13d) 


As in linear programming, interior-point methods gen- 
erate a sequence of iterates (x k ,\ k ,s k ) in which all 
components of x k and s k are strictly positive. The 
basic primal-dual step is obtained by applying New- 
ton’s method at ( x k , A k ,s k ) to the nonlinear equations 
defined by (13 a), (13 b), and (13 d), with the right- 
hand side in (13 d) replaced by a positive parameter 
Pk, which is reduced to zero gradually as the iterations 
progress. The basic approach can be enhanced in vari- 
ous ways: quasi-Newton approximations, line searches 
or trust regions, second-order corrections to the search 
direction, and so on. 

6.4 Augmented Lagrangian Methods 

An approach for solving (12) that was first proposed in 
the early 1970s is enjoying renewed popularity because 
of its successful use in new application areas. Originally 
known as the “method of multipliers,” it is founded on 
the augmented Lagrangian function 

£ a ( x , \\p) := f(x) + X A jCj(x) + y- X 
je£ ^ J'e£ 

for some positive parameter p. The method defines a 
sequence of primal-dual iterates ( x k ,\ k ) for a given 
sequence of parameters {pk}, where each iteration is 
defined as follows: 
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• obtain x k+1 by solving (approximately) the prob- 
lem 

mm£A(x,\ k \ Pk) subject to x ^ 0; (14) 

• update Lagrange multipliers 

Aj +1 = A ) + c j (x k )lp k , je£; 

• choose pk + 1 € (0, Pk] by some heuristic. 

The original problem with nonlinear constraints is 
replaced by a sequence of bound-constrained problems 
(14). Unlike in interior-point methods, it is not neces- 
sary to drive the parameters Pk to zero to obtain satis- 
factory convergence. Although the motivation for this 
approach is perhaps not as clear as for other algo- 
rithms, it can be seen that, if the Lagrange multipliers 
\ k happen to be optimal in (14), then the solution of the 
original nonlinear program (12) would also be optimal 
for this subproblem. Under favorable assumptions, and 
provided that the sequence {pk} is chosen judiciously, 
we find that the sequence ( x k , A fe ) converges to a point 
satisfying the optimality conditions (13). 

Augmented Lagrangian methods were first proposed 
by Hestenes (in 1969) and Powell (in 1969). A 1982 book 
by Bertsekas was influential in later developments. The 
approach has proved particularly useful in “splitting” 
schemes, where the objective / is decomposed natu- 
rally into a sum of functions, each of which is assigned 
its own copy of the variable vector x. Equality of the 
different copies is enforced via equality constraints, 
and the augmented Lagrangian method is applied to 
the resulting equality-constrained problem. The appeal 
of this approach is that minimization with respect to 
each copy of x can be performed independently and 
these individual minimizations may be simpler to per- 
form than minimization of the original function f. 
Moreover, the possibility arises of performing these 
minimizations simultaneously on a parallel computer. 

6.5 Penalty Functions and Filters 

Penalty functions combine the objective and constraint 
functions for a nonlinear program (1) into a single func- 
tion, yielding an alternative problem whose solution 
is an approximate solution to the original constrained 
problem. The augmented Lagrangian function of sec- 
tion 6.4 can be viewed as a penalty function. Another 
important case is the f\ penalty function, which is 
defined as follows: 

f(x) + v I C{ (x ) | + v max(-Cj(x),0), (15) 

ieX ie2 


where v > 0 is a chosen penalty parameter. Note that 
each term in the summations is positive if and only if 
the corresponding constraint in (1) is violated. Under 
certain conditions, we have for v sufficiently large that 
a local solution of (1) is an exact minimizer of (15). 
In other words, we can replace the constrained prob- 
lem (1) by the unconstrained, but nonsmooth, prob- 
lem of minimizing (15). One possible way to make use 
of this observation is to choose v and minimize (15) 
directly, increasing v as needed to ensure that the solu- 
tions of (15) and (1) coincide. More commonly, (15) 
is used as a merit function to evaluate the quality 
of proposed steps d k generated by some other algo- 
rithm, such as sequential quadratic programming or 
an interior-point method. Such steps are accepted only 
if they produce a sufficient reduction in (15) or some 
other merit function. 

An alternative device to decide whether proposed 
steps are acceptable is a filter, which dispenses with 
the penalty parameter v in (1 5) and considers the objec- 
tive function / and the constraint violations separately. 
Defining the violation measure by 

h(x) = |Cj(x)| + ^ max(-Ci(x),0), 

ie£ ie 1 

the filter consists of a set of pairs {(fg, h() : £ e F} of 
objective and constraint values such that no pair dom- 
inates another: that is, we do not have fg ^ fj and 
hg ^ hj for any #£f and j e F. An iterate x k is accept- 
able provided that (f(x k ), h(x k )) is not dominated by 
any point in the filter F. When accepted, the new pair 
is added to the filter, and any pairs that are dominated 
by it are removed. (This basic strategy is amended in 
several ways to improve its practical performance and 
to facilitate convergence analysis.) 

7 Final Remarks 

Our brief description of major problem classes in con- 
tinuous optimization, and algorithms for solving them, 
has necessarily omitted several important topics. We 
mention several of these before closing. 

Stochastic and robust optimization deal with prob- 
lems in which there is uncertainty in the objective func- 
tions or constraints but where the uncertainty can be 
quantified and modeled. In these problems we may 
seek solutions that minimize the expected value of the 
uncertain objective or solutions that are guaranteed to 
satisfy the constraints with a certain specified proba- 
bility. The stochastic gradient method of section 4.4 
provides one tool for solving these problems, but there 
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are many other relevant techniques from optimization 
and statistics that can be brought to bear. 

Equilibrium problems are not optimization problems, 
in that there is no objective to be minimized, but they 
use a range of algorithmic techniques that are closely 
related to optimization techniques. The basic formula- 
tion is as follows: given a function F : R” — ■ R”, find a 
vector x e R” such that 

x ^ 0, -F(x) ^ 0, XiFi(x) = 0, i = l,2 n. 

(Note that the KKT conditions in (6) have a similar 
form.) Equilibrium problems arise in economic appli- 
cations and game theory. More recently, applications 
have been identified in contact problems in mechanical 
simulations. 

Nonlinear equations, in which we seek a vector x e 
R” such that F(x) = 0 for some smooth function 
F: R" -> R”, arise throughout scientific computing. 
Newton’s method, so fundamental in continuous opti- 
mization, is also key here. The Newton step is obtained 
by solving 

VF(x k )d k = -F(x k ) 

(compare this with (10)), where VF(x) = [3F,/3xj]” J=1 
is the nx n Jacobian matrix. 
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IV. 12 Numerical Solution of Ordinary 
Differential Equations 

Ernst Hairer and Christian Lubich 

1 Introduction: Euler Methods 

Ordinary differential equations are ubiquitous in sci- 
ence and engineering: in geometry and mechanics 
from the first examples onward (Newton, Leibniz, 
Euler, Lagrange), in chemical reaction kinetics, molecu- 
lar dynamics, electronic circuits, population dynamics, 


and in many more application areas. They also arise, 
after semidiscretization in space, in the numerical 
treatment of time-dependent partial differential equa- 
tions, which are even more impressively omnipresent 
in our technologically developed and financially con- 
trolled world. 

The standard initial-value problem is to determine 
a vector-valued function y : [to, T] ->■ R d with a given 
initial value ;y(to) = yo £ R d such that the derivative 
y'(t) depends on the current solution value y(t) at 
every t e [to, T] in a prescribed way: 

y'(t) = f(t,y(t)) forto^t^T, y(t 0 )=y 0 . 

Here, the given function f is defined on an open subset 
of R x R d containing ( to , yo ) and takes values in R d . 
If / is continuously differentiable, then there exists a 
unique solution at least locally on some open interval 
containing to. In many applications, t represents time, 
and it will be convenient to refer to t as time in what 
follows. 

In spite of the ingenious efforts of mathematicians 
throughout the eighteenth and nineteenth centuries, in 
most cases the solution of a differential equation can- 
not be given in closed form by functions that can be 
evaluated directly on a computer. This even applies to 
linear differential equations y' = Ay with a square 
matrix A, for which y(t) = e (t ~ to)A yo, as comput- 
ing the matrix exponential [11.14] is a notoriously 
tricky problem. One must therefore rely on numerical 
methods that are able to approximate the solution of a 
differential equation to any desired accuracy. 

1.1 The Explicit Euler Method 

The ancestor of all the advanced numerical methods 
in use today was proposed by Leonhard Euler in 1768. 
On writing down the first terms in the Taylor expan- 
sion of the solution at to and using the prescribed ini- 
tial value and the differential equation at t = to, it 
is noted that y(to + h) = y(to) + hy' (to) + ■ ■ ■ = 
yo + h f(to,yo) + ■ ■ ■ . Choosing a small step size h > 0 
and neglecting the higher-order terms represented by 
the dots , an approximation y i to y ( ti ) at the later time 
t] = to + h is obtained by setting 

yi = yo + hftio.yo). 

The next idea is to take y\ as the starting value for a 
further step, which then yields an approximation to the 
solution at tz = t\ + h as y 2 = yi +hf(ti,yi). Continu- 
ing in this way, at the ( n + 1 ) st step we take y„ ~ y(t n ) 
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as the starting value for computing an approximation 
at t n+ 1 = t n + h as 

Vn+i = y n + hf(t„,y n ), 

and after a sufficient number of steps, we reach the 
final time T. The computational cost of the method lies 
in the evaluations of the function /. The step size need 
not be the same in each step and could be replaced by 
h„ in the formula, so that t n+ i = t n + h n . 

It is immediate that the quality of the approximation 
y n depends on two aspects: the error made by trun- 
cating the Taylor expansion and the error introduced 
by continuing from approximate solution values. These 
two aspects are captured in the notions of consistency 
and stability, respectively, and are fundamental to all 
numerical methods for ordinary differential equations. 



Figure 1 The exact solution (solid line), the implicit Euler 
solution (h = 0.5, dashed line), and two explicit Euler solu- 
tions ( h = 0.038, dotted line; h = 0.041, gray line) for the 
problem y' = -50(y - cos t), y( 0) = 0. 


1.2 The Implicit Euler Method and Stiff Differential 
Equations 

A minor-looking change in the method, already consid- 
ered by Euler in 1768, makes a big difference; taking as 
the argument of / the new value instead of the previous 
one yields 

yn+l — yn + hf(t n +\ , yn+1 ), 

from which y n +i is now determined implicitly. In 
general, the new solution approximation needs to be 
computed iteratively, typically by a modified Newton 
method such as = j4+ i + Ay^j.\, where the 

increment is computed by solving a linear system of 
equations 

(I-hJ n )AyJ& = -r & ) 1 

with an approximation J n to the Jacobian matrix 
dy f {tn, yn) and the residual 

r n + 1 = yn + 1 ~ yn - hf (fn+i,y^+\). 

The computational cost per step has increased dramat- 
ically; whereas the explicit Euler method requires a sin- 
gle function evaluation, we now need to compute the 
Jacobian and then solve a linear system and evaluate f 
on each Newton iteration. 

Why it may nevertheless be preferable to perform the 
computation using the implicit rather than the explicit 
Euler method is evident for the scalar linear example, 
made famous by Germund Dahlquist in 1963, 

y' = 

where the coefficient A is large and negative (or complex 
with large negative real part). Here the exact solution 
y (f) = eh-t 0 )hy 0 decays to zero as time increases, and 


so does the numerical solution given by the implicit 
Euler method for every step size h > 0: 

: K™ pl = (1 - hA)~ n yo. 

In contrast, the explicit Euler method yields 
j4 xpl = (1 + h\) n y 0 , 

which decays to zero for growing n only when h is so 
small that 1 1 + hA| < 1. This imposes a severe step-size 
restriction when A is a negative number of large abso- 
lute value (just think of A = — 10 10 ). For larger step 
sizes the numerical solution suffers an instability that 
is manifested in wild oscillations of increasing ampli- 
tude. The problem is that the explicit Euler method and 
the differential equation have completely different sta- 
bility behaviors unless the step size is chosen extremely 
small (see figure 1). 

Such behavior is not restricted to the simple scalar 
example considered above but extends to linear sys- 
tems of differential equations in which the matrix has 
some eigenvalues with large negative real part and to 
classes of nonlinear differential equations with a Jaco- 
bian matrix d y f having this property. The explicit and 
implicit Euler methods also give rise to very different 
behaviors for nonlinear differential equations in which 
the function f(t,y) has a large (local) Lipschitz con- 
stant L with respect to y , while for some inner product 
the inequality 

< f(t,y ) - f(t,z),y - z) ^ i?|| y - z|| 2 

holds for all t and y , z with a moderate constant £ <k L 
(called a one-sided Lipschitz constant). 

Differential equations for which the numerical solu- 
tion using the implicit Euler method is more efficient 
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than that using the explicit Euler method are called stiff 
differential equations. They include important appli- 
cations in the description of processes with multiple 
timescales (e.g., fast and slow chemical reactions) and 
in spatial semidiscretizations of time-dependent par- 
tial differential equations. For example, for the heat 
equation, stable numerical solutions are obtained with 
the explicit Euler method only when temporal step 
sizes are bounded by the square of the spatial grid size, 
whereas the implicit Euler method is unconditionally 
stable. 

1.3 The Symplectic Euler Method and Hamiltonian 
Systems 

An important class of differential equations for which 
neither the explicit nor the implicit Euler method is 
appropriate is Hamiltonian differential equations , 

p' = -S7 q H(p,q), q' = +V p H(p,q), 

which are fundamental to many branches of physics. 
Here, the real-valued Hamilton function H, defined on 
a domain of M. d+d , represents the total energy, and 
q(t) e R d and p(t) e represent the positions and 
momenta, respectively, of a conservative system at time 
t. The total energy is conserved: 

H(p(t),q(t )) = const. 

along any solution ( p(t),q(t )) of the Hamiltonian sys- 
tem. It turns out that a partitioned method obtained 
by applying the explicit Euler method to the position 
variables and the implicit Euler method to the momen- 
tum variables (or vice versa) behaves much better than 
either Euler method applied to the system as a whole. 
The symplectic Euler method reads 

Pn + 1 = Pn - hV a H(p n+ l,qn), 
q-n + 1 — q-n + hV p H(p n+ i,q n ). 

For a separable Hamiltonian H(p, q) = T(p)+V(q) the 
method is explicit. 

Figure 2 illustrates the qualitative behavior of the 
three Euler methods applied to the differential equa- 
tions of the mathematical pendulum, 

p' = — sinq, q' = p, 

which are Hamiltonian with H{p, q) = |p 2 -cosq.The 
energy of the implicit Euler solution decreases, while 
that of the explicit Euler solution increases. The sym- 
plectic Euler method nearly conserves the energy over 
extremely long times. 


Implicit p Symplectic Explicit 



iPlf 
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Figure 2 The pendulum equation: Euler polygons with step 
size h = 0.3; initial value p{ 0) = 0 and q(0) = 1.7 for the 
explicit Euler method, q(0) = 1.5 for the symplectic Euler 
method, and q( 0) = 1.3 for the implicit Euler method. The 
solid lines are solution curves for the differential equations. 

2 Basic Notions 

In this section we describe some of the mechanisms 
that lead to the different behaviors of the various 
methods. 

2.1 Local Error 

For the explicit Euler method, the error after one step 
of the method starting from the exact solution, called 
the local error, is given as 

d n + 1 = ( y(t n ) + hf(t n ,y(t n ))) - yttn + h). 

By estimating the remainder term in the Taylor expan- 
sion of y (t„ + h) at t n , we can bound d n+ 1 by 

\\d n +i || < Ch 2 with C = 1 max ||v"(t)||, 

provided that the solution is twice continuously dif- 
ferentiable, which is the case if / is continuously 
differentiable. 

2.2 Error Propagation 

Since the method advances in each step with the com- 
puted values y n instead of the exact solution values 
y (t n ), it is important to know how errors, once intro- 
duced, are propagated by the method. Consider explicit 
Euler steps starting from different starting values: 

Un+l — t i n + hf(t n ,u n ), 

V n + 1 =V n + hf(t n ,V n ). 

When f is (locally) Lipschitz continuous with Lipschitz 
constant L, the difference is controlled by the stability 
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estimate 

\\u n +l ~ Vn+l\\ ^ (1 + hL)\\u n - V n \\. 

2.3 Lady Windermere’s Fan 

The above estimates can be combined to study the 
error accumulation, as illustrated by the fan of figure 3 
(named by Gerhard Warmer in the 1980s after a play 
by Oscar Wilde). Each arrow from left to right repre- 
sents a step of the numerical method, with different 
starting values. The fat vertical bars represent the local 
errors, whose propagation by the numerical method is 
controlled using the stability estimate repeatedly from 
step to step; the global error 

e-n = y n - y(tn ) 

is the sum of the propagated local errors (represented 
as the distances between two adjacent arrowheads end- 
ing at t n in figure 3). The contribution of the first 
local error d\ to the global error e n is bounded by 
(1 + hi)” -1 1| di ||, as is seen by applying the stability 
estimate n - 1 times following the numerical solutions 
starting from y\ and y(t i ) . The contribution of the sec- 
ond local error d 2 is bounded by (1 + hi)” -2 11^2 II, and 
so on. Since the local errors are bounded by Ch 2 , the 
global error is thus bounded by 
n - 1 

||e n || ^ (1 + hI) J Ch 2 

j=o 

d + hl)”-l 2 
1 + hi - 1 

p nhL _ i 

^~I— Ch - 

WithM = (e (r-to)L - 1 )C/I, the global error satisfies 
II fin II ^ Mh for t n ^ T. 

The numerical method thus converges to the exact 
solution as h — 0 with nh fixed, but only at first order, 
that is, with an error bound proportional to h. We will 
later turn to higher-order numerical methods, with an 
error bound proportional to h p with p > 1. 

2.4 Stiff Differential Equations 

The above error bound becomes meaningless for stiff 
problems, where I is large. The implicit Euler method 
admits an analogous error analysis in which only the 
one-sided Lipschitz constant I appears in the stability 
estimate; provided that hi < 1, 

„ 1 , 

|ttn+l Vn+l\\ ^ ^ . ^ II tin Vn II 



Figure 3 Lady Windermere’s fan. 


holds for the results of two Euler steps starting from 
u n and v n . For stiff problems with I <sc I this is 
much more favorable than the stability estimate of the 
explicit Euler method in terms of I. It leads to an error 
bound ||e„|| < mh, in which m is essentially of the 
same form as M above but with the Lipschitz constant 
I replaced by the one-sided Lipschitz constant I. 

The above arguments explain the convergence behav- 
ior of the explicit and implicit Euler methods and their 
fundamentally different behavior for large classes of 
stiff differential equations. They do not explain the 
favorable behavior of the symplectic Euler method for 
Hamiltonian systems. This requires another concept, 
backward analysis, which is treated next. 

2.5 Backward Analysis 

Much insight into numerical methods is obtained by 
interpreting the numerical result after a step as the 
(almost) exact solution of a modified differential equa- 
tion. Properties of the numerical method can then be 
inferred from properties of a differential equation. For 
each of the Euler methods applied to y' = fly) an 
asymptotic expansion 

fly) = f(y ) + hf 2 (y ) + h 2 f 3 (y) + ■ ■ ■ 

can be uniquely constructed recursively such that, up 
to arbitrarily high powers of h, 

Ti = y(h), 

where y(t) is the solution of the modified differential 
equation y' = fly) with initial value yo- The remark- 
able feature is that, when the symplectic Euler method 
is applied to a Hamiltonian system, the modified dif- 
ferential equation is again Hamiltonian. The modified 
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Hamilton function has an asymptotic expansion 

H = H + hH 2 + h 2 H 3 + ■■■ . 

The symplectic Euler method therefore conserves the 
modified energy H (up to arbitrarily high powers of h), 
which is close to the exact energy H. This conserva- 
tion of the modified energy prevents the linearly grow- 
ing drift in the energy that is present along numerical 
solutions of the explicit and implicit Euler methods. For 
these two methods the modified differential equation is 
no longer Hamiltonian. 

3 Nonstiff Problems 

3.1 Higher-Order Methods 

A method is said to have order p if the local error 
(recall that this is the error after one step of the method 
starting from the exact solution) is bounded by Ch p+1 , 
where h is the step size and C depends only on bounds 
of derivatives of the solution y(t) and of the func- 
tion /. As for the Euler method in section 2.1, the order 
is determined by comparing the Taylor expansions of 
the exact solution and the numerical solution, which for 
a method of order p should agree up to and including 
the h p term. 

A drawback of the Euler methods is that they are 
only of order 1. There are different ways to increase the 
order: using additional, auxiliary function evaluations 
in passing from y n to v n + i (one-step methods); using 
previously computed solution values y n -i,yn- 2 , ■ ■ ■ 
and/or their function values (multistep methods); or 
using both (general linear methods). For nonstiff initial- 
value problems the most widely used methods are 
explicit Runge-Kutta methods of orders up to 8, in the 
class of one-step methods, and Adams-type multistep 
methods up to order 12. For very stringent accuracy 
requirements of 10 or 100 digits, high-order extrapo- 
lation methods or high-order Taylor series expansions 
of the solution (when higher derivatives of / are avail- 
able with automatic differentiation software) are some- 
times used. 

3.2 Explicit Runge-Kutta Methods 

Two ideas underlie Runge-Kutta methods. First, the 
integral in 

y(t 0 + h) = y(t 0 ) + h [ y'(t 0 + 0h)d0, 

Jo 


with y'(t) = fit, y {t)), is approximated by a quadra- 
ture formula with weights bi and nodes cy 

S 

yi = yo + h £ biY[, Y[ = /(to + Cih, Yi). 

i=l 

Second, the internal stage values Yj a y(to + Cih) 
are determined by another quadrature formula for the 
integral from 0 to cy. 

S 

Yi = y 0 + h X a tjY' } i= 1 5, 

j = i 

with the same function values Y- as for y\. If the coef- 
ficients satisfy ay = 0 for j ^ i, then the above 
sum actually extends only from j = 1 to i — 1, and 
hence Yi, Y{, Y 2 , Y ' 2 , . . . , Y s , Y' s can be computed explic- 
itly one after the other. The methods are named after 
Carl Runge, who in 1895 proposed two- and three- 
stage methods of this type, and Wilhelm Kutta, who 
in 1901 proposed what is now known as the classi- 
cal Runge-Kutta method of order 4, which extends the 
Simpson quadrature rule from integrals to differential 
equations: 
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Using Lady Windermere’s fan as in section 2, one 
finds that the global error y n - y (t n ) of a pth-order 
Runge-Kutta method over a bounded time interval 
is 0(h p ). 

The order conditions of general Runge-Kutta meth- 
ods were elegantly derived by John Butcher in 1964 
using a tree model for the derivatives of f and their 
concatenations by the chain rule, as they appear in the 
Taylor expansions of the exact solution and the numer- 
ical solution. This enabled the construction of meth- 
ods of even higher order, among which excellently con- 
structed methods of orders 5 and 8 by Dormand and 
Prince (from 1980) have found widespread use. These 
methods are equipped with lower-order error indica- 
tors from embedded formulas that use the same func- 
tion evaluations. These error indicators are used for an 
adaptive selection of the step size that is intended to 
keep the local error close to a given error tolerance in 
each time step. 
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3.3 Extrapolation Methods 

A systematic, if suboptimal, construction of explicit 
Runge-Kutta methods of arbitrarily high order is pro- 
dded by Richardson extrapolation of the results of 
the explicit Euler method obtained with different step 
sizes. This technique makes use of an asymptotic 
expansion of the error, 

y(t,h) - y(t) = ei(t)h + e 2 (t)h 2 + ■ ■ ■ , 
where y(t,h) is the explicit Euler approximation at t 
obtained with step size h. At t = to + H, the error 
expansion coefficients up to order p can be eliminated 
by evaluating at h = 0 ( extrapolating ) the interpola- 
tion polynomial through the Euler values y (to + H, hj ) 
for j = 1, . . . , p corresponding to different step sizes 
hj = H / j. This gives a method of order p, which for- 
mally falls into the general class of Runge-Kutta meth- 
ods. Instead of using the explicit Euler method as the 
basic method, it is preferable to use Gragg’s method 
(from 1964), which uses the explicit midpoint rule 

y n +i = y-n-i + 2hf(t n ,y n ), n Js 1, 
and an explicit Euler starting step to compute y\. This 
method has an error expansion in powers of h 2 (instead 
of h, above) at even n, and with the elimination of each 
error coefficient one therefore gains a power of h 2 . 
Extrapolation methods have built-in error indicators 
that can be used for order and step-size control. 

3.4 Adams Methods 

The methods introduced by astronomer John Couch 
Adams in 1855 were the first of high order that 
used only function evaluations, and in their cur- 
rent variable-order, variable-step-size implementations 
they are among the most efficient methods for general 
nonstiff initial-value problems. 

When k function values / n _j = f(t n -j,y n -j) ( j = 
0, . . . , k - 1) have already been computed, the integrand 
in 

y(tn+ i) = y(tn) + h f f(t n + 0h,y(t n + Oh)) d9 
Jo 

is approximated by the interpolation polynomial P(t) 

through (t n -j, fn-j), j = 0 k - 1, yielding the 

explicit Adams method of order k, 

y n +i= y n + h\ P(t n + 6h)d0, 

Jo 

which, upon inserting the Newton interpolation for- 
mula, becomes (for constant step size h) 

k - 1 

y n + 1 = y n + hf n + h^ YiV'fn, 

i= 1 


with the backward differences V /„ = f n ~fn- 1 , V'/ n = 
V l_1 / ra - V l_1 / n _ i, and with coefficients (y,) = 

|, Hjj, . . . ). The method thus corrects the explicit Euler 
method by adding differences of previous function 
values. 

Especially for higher orders, the accuracy of the 
approximation suffers from the fact that the interpo- 
lation polynomial is used outside the interval of inter- 
polation. This is avoided if the (as yet) unknown value 
(t n +i,fn+i) is added to the interpolation points. Let 
P*(t) denote the corresponding interpolation poly- 
nomial, which is now used to replace the integrand 
f(t,y(t)). This yields the implicit Adams method of 
order k + 1 , which takes the form 

k 

y n + 1 = y n + hfn+i + h X y* v’/jt+i, 

i-1 

with (yf) = (- 2 ,-T 2 ,- 23 ,- 7 ^,--.)- The equation 
for y n+ 1 is solved approximately by one or at most 
two fixed-point iterations, taking the result from the 
explicit Adams method as the starting iterate (the pre- 
dictor) and inserting its function value on the right- 
hand side of the implicit Adams method (the corrector). 

In a variable-order, variable-step-size implementa- 
tion, the required starting values are built up by start- 
ing with the methods of increasing order 1,2,3,..., one 
after the other. Strategies for increasing or lowering 
the order are based on monitoring the backward differ- 
ence terms. Changing the step size is computationally 
more expensive, since it requires a recalculation of all 
method coefficients. It is facilitated by passing informa- 
tion from one step to the next by the Nordsieck vector , 
which collects the values of the interpolation polyno- 
mial and all its derivatives at t n scaled by powers of 
the step size. 

3.5 Linear Multistep Methods 

Both explicit and implicit Adams methods (with con- 
stant step size) belong to the class of linear multistep 
methods 

k k 

X a jyn+j = h X Pjfn+j> 
j = 0 J= 0 

with ft = f(U,yi) and a ^ 0. This class also includes 

important methods for stiff problems, in particular the 
backward differentiation formulas to be described in 
the next section. The theoretical study of linear mul- 
tistep methods was initiated by Dahlquist in 1956. He 
showed that for such methods 

consistency + stability <^=> convergence, 
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which, together with the contemporaneous Lax equiva- 
lence theorem for discretizations of partial differential 
equations, forms a basic principle of numerical analy- 
sis. What this means here is described in more detail 
below. 

In contrast to one-step methods, having high order 
does not by itself guarantee that a multistep method 
converges as h — ■ 0. In fact, choosing the method coef- 
ficients in such a way that the order is maximized for 
a given fc leads to a method that produces wild oscil- 
lations, which increase in magnitude with decreasing 
step size. One requires in addition a stability condi- 
tion, which can be phrased as saying that all solutions 
to the linear difference equation o a jyn+j = 0 stay 
bounded as n — ■ oo, or equivalently: 

All roots of the polynomial Yq=o 01 j& are hi the com- 
plex unit disk, and those on the unit circle are simple. 

If this stability condition is satisfied and the method 
is of order p, then the error satisfies y n - y(t n ) = 
0(h p ) on bounded time intervals, provided that the 
error in the starting values is 0(h v ). 

Dahlquist also proved order barriers: the order of a 
stable k-step method cannot exceed fc + 2 if fc is even, 
k + 1 if k is odd, and k if the method is explicit ((Ik = 0). 

3.6 General Linear Methods 

Predictor-corrector Adams methods fall neither into 
the class of multistep methods, since they use the pre- 
dictor as an internal stage, nor into the class of Runge- 
Kutta methods, since they use previous function values. 
Linear multistep methods and Runge-Kutta methods 
are extreme cases of a more general class of methods 

U n +1 = Su n + h4>(tn,U n ,h), 

where u n is a vector (usually of dimension a multi- 
ple of the dimension of the differential equation) from 
which the solution approximation y n ~ y(t„) can be 
obtained by a linear mapping, S is a square matrix, 
and 4> depends on function values of /. (For exam- 
ple, for predictor-corrector methods we would have 
Un = (y n , Vn ed , y n - 1, ■ ■ ■ , y n -k+ 1 ) in this framework.) 

More general methods like these have been studied 
since the mid-1960s with the objective of looking for 
the “greatest good as a mean between extremes” (in 
the words of Aristotle and John Butcher). They include 
a number of methods of potential interest for both 
nonstiff and stiff problems, such as two-step Runge- 
Kutta methods or general linear methods with inherent 
Runge-Kutta stability, but as of now do not appear to 


have found their way into applications via competitive 
software. 

4 Stiff Problems 

We saw in the introduction that for important classes 
of differential equations, called stiff equations, the 
implicit Euler method yields a drastic improvement 
over the explicit Euler method. Are there higher-order 
methods with similarly good properties? 

4.1 Backward Differentiation Formula Methods 

The k-step implicit Adams methods, though natu- 
rally extending the implicit Euler method, perform 
disappointingly on stiff problems for k > 1. Multi- 
step methods from another extension of the implicit 
Euler method, which is based on numerical differen- 
tiation rather than integration, turn out to be better 
for stiff problems. Suppose that fc solution approxi- 
mations y n -k+ 1 , ■■ - ,Vn have already been computed, 
and consider the interpolation polynomial u ( t ) passing 
through y n +i-j at t n+ i-j for j = 0, . . . , k, including the 
as yet unknown approximation y n + 1 - We then require 
the collocation condition 

u'(t) = f(t,u(t)) att = t„+i, 
or equivalently, in the case of a constant step size h, 

k 1 

X -V J 'y n +i = hf n +i. 

i=i J 

This backward differentiation formula (BDF) is an 
implicit linear multistep method of order k, which is 
found to be unstable for k > 6. Methods for smaller k, 
however, up to k ^ 5, are currently the most widely 
used methods for stiff problems, which are imple- 
mented in numerous computer codes. The usefulness 
of these methods was first observed by Curtiss and 
Hirschfelder in 1952, who also coined the notion of 
stiff differential equations. Bill Gear's BDF code “DIF- 
SUB” from 1971 was the first widely used code for stiff 
problems. It brought BDF methods (“Gear’s method”) 
to the attention of practitioners in many fields. 

4.2 A-Stability and Related Notions 

Which properties make BDF methods successful for 
stiff problems? In 1963, Dahlquist systematically stud- 
ied the behavior of multistep methods on the scalar 
linear differential equation 

y' = A y with A 6 C, Re A ^ 0, 
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whose use as a test equation can be justified by lin- 
earization of the differential equation and diagonaliza- 
tion of the Jacobian matrix. The behavior of a numerical 
method on this deceivingly simple scalar linear differ- 
ential equation gives much insight into its usefulness 
for more general stiff problems, as is shown by both 
numerical experience and theory. 

Clearly, the exact solution y(t) = e tA yo remains 
bounded for t — + co when Re A ^ 0. Following 
Dahlquist, a method is called A-stable if for every A e C 
with Re A ^ 0, the numerical solution y„ stays bounded 
as n — • oo for every step size h > 0 and every choice 
of starting values. The implicit Euler method and the 
second-order BDF method are A-stable, but the BDF 
methods of higher order are not. Dahlquist’s second- 
order barrier states that the order of an A-stable lin- 
ear multistep method cannot exceed 2. This funda- 
mental, if negative, theoretical result has led to much 
work aimed at circumventing the barrier by using other 
methods or weaker notions of stability. 

The stability region S is the set of all complex z = 
frA, such that every numerical solution of the method 
applied to y' = A y with step size h stays bounded. 
The stability regions of explicit and implicit k-step 
Adams methods with k > 1 are bounded, which leads to 
step-size restrictions when A has large absolute value. 
The BDF methods up to order 6 are A(oO-stable; that 
is, the stability region contains an unbounded sector 
|arg(-z)| ^ a with a = 90°, 90°, 86°, 73°, 51°, 17° for 
k = 1, . . . , 6, respectively. The higher-order BDF meth- 
ods therefore perform well for differential equations 
where the Jacobian has large eigenvalues near the neg- 
ative real half-axis, but they behave poorly when there 
are large eigenvalues near the imaginary axis. 

4.3 Implicit Runge-Kutta Methods 

It turns out that there is no order barrier for A-stable 
Runge-Kutta methods. 

Explicit Runge-Kutta methods cannot be A-stable 
because application of such a method to the linear test 
equation yields y n +i = P(h\)y n , where P is a poly- 
nomial of degree s, the number of stages. The stability 
region of such a method is necessarily bounded, since 
\P(z)\ — • oo as |z| — oo. 

On the other hand, an implicit Runge-Kutta method 
has 

y n + 1 = R(h\)y n 

with a rational function R(z), called the stability func- 
tion of the method, which is an approximation to the 


exponential at the origin, R(z) = e z + 0(zf +1 ) as z — 0. 
The method is A-stable if |R(z)| ^ 1 for Rez ^ 0. The 
subtle interplay between order and stability is clari- 
fied by the theory of order stars , developed by Wanner, 
Hairer, and Norsett in 1978. In particular, this theory 
shows that among the pade approximants [IV. 9 §2.4] 
Rkj(z) to the exponential (the rational approxima- 
tions of numerator degree k and denominator degree 
j of highest possible order p = j + k), precisely 
those with k ^ j ^ k + 2 are A-stable. Optimal- 
order implicit Runge-Kutta methods having the diag- 
onal Pade approximants R s .s as stability function are 
the collocation methods based on the Gauss quadra- 
ture nodes, while those having the subdiagonal Pade 
approximants R s -i, s are the collocation methods based 
on the right-hand Radau quadrature nodes. We turn to 
these important implicit Runge-Kutta methods next. 

4.4 Gauss and Radau Methods 

A collocation method based on the nodes 0 ^ Ci < 
■ ■ ■ < c s ^ 1 determines a polynomial u(t) of degree 
at most 5 such that it (to) = yo and the differential 
equation is satisfied at the s points to + C;h: 

u'(t) = /(t, u(t)) at t = to + Cih, i = 1 , . . . ,s. 

The solution approximation at the endpoint is then 
Vi = w(to + h), 

which is taken as the starting value for the next step. 
As was shown by Ken Wright (in 1970), such a col- 
location method is equivalent to an implicit Runge- 
Kutta method, the order of which is equal to the order 
of the underlying interpolatory quadrature with nodes 
Cj. The highest order p = 2s is thus obtained with 
Gauss nodes. Nevertheless, Gauss methods have found 
little use in stiff initial-value problems (as opposed 
to boundary-value problems; see section 6). The rea- 
son for this is that the stability function here satisfies 
|R(z)| — 1 as z oo, whereas e z — 0 as z — ■ -oo. 

The desired damping property at infinity is obtained 
for the subdiagonal Pade approximant. This is the sta- 
bility function for the collocation method at Radau 
points, which are the nodes of the quadrature formula 
of order p = 2s - 1 with c s = 1. Let us collect the basic 
properties: the 5-stage Radau method is an implicit 
Runge-Kutta method of order p = 2s - 1; it is A-stable 
and has R( oo) = 0. 

The Radau methods have some more remarkable fea- 
tures: they are nonlinearly stable with the so-called 
algebraic stability property; their last internal stage 
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equals the starting value for the next step (this prop- 
erty is useful for very stiff and for differential-algebraic 
equations); and their internal stages all have order 5 . 

The last property does indeed hold for every collo- 
cation method with 5 nodes. It is important because of 
the phenomenon of order reduction ; in the application 
of an implicit Runge-Kutta method to stiff problems, 
the method may have only the stage order, or stage 
order + 1, with stiffness-independent error constants, 
instead of the full classical order p that is obtained with 
nonstiff problems. 

The implementation of Radau methods by Hairer and 
Wanner (from 1991) is known for its robustness in 
dealing with stiff problems and differential-algebraic 
equations of the type My' = f(t,y) with a singular 
matrix M. 

4.5 Linearly Implicit Methods 

BDF and implicit Runge-Kutta methods are fully im- 
plicit, and the resulting systems of nonlinear equations 
need to be solved by variants of Newton’s method. To 
reduce the computational cost while retaining favor- 
able linear stability properties, linearly implicit meth- 
ods have been proposed, such as the linearly implicit 
Euler method, in which only a single iteration of New- 
ton’s method is done in each step: 

( I ~ hj n ) (yn + 1 — yn ) — h-fn i 

where J n « d y f(t n ,y n ). Thus just one linear system 
of equations is solved in each time step. The method is 
identical to the implicit Euler method for linear prob- 
lems and therefore inherits its A-stability. Higher-order 
linearly implicit methods can be obtained by Richard- 
son extrapolation of the linearly implicit Euler method, 
or they are specially constructed Rosenbrock meth- 
ods. Like explicit Runge-Kutta methods, these methods 
determine the solution approximation as 
s i- 1 

Vi= yo + h X biY;, Y t = y 0 + h X Oy/Yj, 

1=1 j=i 

but compute the derivative stages consecutively by 
solving 5 linear systems of equations (written here for 
an autonomous problem, f(t,y) = fly) and J = 
dyflyo)Y 

i-1 

(I - yhJ)Y[ = f(Yi) + hj X YijYj- 
j=i 

Such methods are easy to implement, and they have 
gained popularity in the numerical integration of 


spatial semidiscretizations of partial differential equa- 
tions. For large problems, the dominating numeri- 
cal cost is in the solution of the systems of linear 
equations, using either direct sparse solvers or itera- 
tive methods such as preconditioned Krylov subspace 
methods. 

4.6 Exponential Integrators 

While it appears an obvious idea to use the exponen- 
tial of the Jacobian in a numerical method, this was for 
a long time considered impractical, and particularly so 
for large problems. This attitude changed, however, in 
the mid-1990s when it was realized that Krylov sub- 
space methods for approximating a matrix exponential 
times a vector, e yhJ v, show superlinear convergence, 
whereas there is generally only linear convergence for 
solving linear systems (I - yhj)x = v. Unless a good 
preconditioner for the linear system is available, com- 
puting the action of the matrix exponential is there- 
fore computationally less expensive than solving a cor- 
responding linear system. This fact led to a revival of 
methods using the exponential or related functions like 
cp(z) = (e z - 1) /z, such as the exponential Euler method 

y n + 1 = y n + hcp(hj n )f n . 

The method is exact for linear fly) = Jy + c. It dif- 
fers from the linearly implicit Euler method in that 
the entire function qp(z) replaces the rational function 
1/(1 - z). Higher-order exponential methods of one- 
step and multistep type have also been constructed. 
Exponential integrators have proven useful for large- 
scale problems in physics and for nonlinear parabolic 
equations, as well as for highly oscillatory problems 
like those considered in section 5.6. 

4.7 Chebyshev Methods 

For moderately stiff problems one can avoid numeri- 
cal linear algebra altogether by using explicit Runge- 
Kutta methods of low order (2 or 4) and high stage 
number, which are constructed to have a large stability 
domain covering a strip near the negative real semi- 
axis. The stability function of such methods is a high- 
degree polynomial related to Chebyshev polynomials. 
The stage number is chosen adaptively to include the 
product of the step size with the dominating eigen- 
values of the Jacobian in the stability domain. With s 
stages, one can cover intervals on the negative real axis 
of a length proportional to 5 2 . The quadratic growth 
of the stability interval with the stage number makes 
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these methods suitable for problems with large neg- 
ative real eigenvalues of the Jacobian, such as spa- 
tial semidiscretizations of parabolic partial differential 
equations. 

5 Structure-Preserving Methods 

The methods discussed so far are designed for general 
differential equations, and a distinction was drawn only 
between nonstiff and stiff problems. There are, how- 
ever, important classes of differential equations with a 
special, often geometric, structure, whose preservation 
in the numerical discretization leads to substantially 
better methods, especially when integrating over long 
times. The most prominent of these are Hamiltonian 
systems, which are all-important in physics. Their flow 
has the geometric property of being symplectic. 

In respecting the phase space geometry under dis- 
cretization and analyzing its effect on the long-time 
behavior of a numerical method, there is a shift of view- 
point from concentrating on the approximation of a 
single solution trajectory to considering the numerical 
method as a discrete dynamical system that approxi- 
mates the flow of a differential equation. 

While many differential equations have interest- 
ing structures to preserve under discretization— and 
much work has been done in devising and analyz- 
ing appropriate numerical methods for doing so— we 
will restrict our attention to Hamiltonian systems here. 
Their numerical treatment has been an active research 
area for the past two decades. 

5.1 Symplectic Methods 

The time-t flow of a differential equation y' = fly) 
is the map qpt that associates with an initial value yo 
at time 0 the solution value at time t: qpt(yo) = y(t). 
Consider a Hamiltonian system 

p' = -VqH(p,q), q’ = V p H(p,q), 
or equivalently, for y = (p,q), 

y'=j- l VH[y) with / = ^ q) ■ 

The flow qp t of a Hamiltonian system is symplectic ; that 
is, the derivative D qpt with respect to the initial value 
satisfies 

Dqpt(y) T JO'Pt(y) = J 


for all y and t for which qptiy) exists. This is a quad- 
ratic relation formally similar to orthogonality, with J 
in place of the identity matrix I, but it is related to 
the preservation of areas rather than lengths in phase 
space. 

A numerical one-step method y n +i = $h(yn) is 
called symplectic if the numerical flow <Ph is a symplec- 
tic map: 

ryp h (y) T JD<I> h (y) = j. 

Such methods exist; the “symplectic Euler method” 
of section 1.3 is indeed symplectic. Great interest 
in symplectic methods was spurred when, in 1988, 
Lasagni, Sanz-Serna, and Suris independently char- 
acterized symplectic Runge-Kutta methods as those 
whose coefficients satisfy the condition 

bi<iij + bjCtji - bibj = 0 . 

Gauss methods (see section 4.4) were already known 
to satisfy this condition and were thus found to be 
symplectic. Soon after those discoveries it was realized 
that a numerical method is symplectic if and only if 
the modified differential equation of backward analy- 
sis (section 2.5) is again Hamiltonian. This made it 
possible to prove rigorously the almost-conservation 
of energy over times that are exponentially long in 
the inverse step size, as well as further favorable 
long-time properties such as the almost-preservation 
of KAM (Kolmogorov-Arnold-Moser) tori of perturbed 
integrable systems over exponentially long times. 

5.2 The Stormer-Verlet Method 

By the time symplecticity was entering the field of 
numerical analysis, scientists in molecular simulation 
had been doing symplectic computations for more than 
20 years without knowing it; the standard integrator 
of molecular dynamics, the method used successfully 
ever since Luc Verlet introduced it to the field in 1967, 
is symplectic. For a Hamiltonian H(p, q) = ^p T M~ 1 p + 
V(q) with a symmetric positive-definite mass matrix M, 
the method is explicit and is given by the formulas 

Pn+ 1/2 = Pn ~ \hS7V{q n ), 

1 l.n+1 — + ilM Pn+l/2i 

Pn+\ — Pn+ 1/2 — 2 hW(q n +i). 

Such a method was also formulated by the astronomer 
Stormer in 1907, and in fact it can even be traced back 
to Newton’s Principia from 1687, where it was used as 
a theoretical tool in the proof of the preservation of 
angular momentum in the two-body problem (Kepler’s 
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second law), which is indeed preserved by this method. 
Given that there are already sufficiently many Newton 
methods in numerical analysis, it is fair to refer to 
the method as the Stormer-Verlet method (the Verlet 
method and the leapfrog method are also often-used 
names). As will be discussed in the next three subsec- 
tions, the symplecticity of this method can be under- 
stood in various ways by relating the method to dif- 
ferent classes of methods that have proven useful in a 
variety of applications. 


5.3 Composition Methods 


Letus denote the one-step map (p n , q n ) •- (p n +i, qn+i) 
of the Stormer-Verlet method by <pf^, that of the sym- 
plectic Euler method of section 1.3 by $| E , and that 
of the adjoint symplectic Euler method by 4>^*, where 
instead of the argument (p n +i,qn) one uses (p n ,dn+ i). 
The second-order Stormer-Verlet method can then be 
interpreted as the composition of the first-order sym- 
plectic Euler methods with halved step size: 

<fiSV _ aSE* ^SE 

~ v h/2° v h/2- 


Since the composition of symplectic maps is sym- 
plectic, this shows the symplecticity of the Stormer- 
Verlet method. We further note that the method is time 
reversible (or symmetric)'. <P-h ° = id, or equiva- 

lently <Ph = with the adjoint method 4^ := $1},. 
This is known to be another favorable property for 
conservative systems. 

Moreover, from first-order methods we have ob- 
tained a second-order method. More generally, start- 
ing from a low-order method 4/ l? methods of arbitrary 
order can be constructed by suitable compositions 


&c s h ° ■ ■ ■ ° $cih or o <P ash o ■ . . o <P* ih o <P aih . 

The first to give systematic approaches to high-order 
compositions were Suzuki and Yoshida in 1990. When- 
ever the base method is symplectic, so is the composed 
method. The coefficients can be chosen such that the 
resulting method is symmetric. 


5.4 Splitting Methods 


Splitting the Hamiltonian H(p,q) = T(p) + V(q) into 
its kinetic energy T(p) = ^p T M^p and its potential 
energy V(q),v/e have that the flows c pj and c p, of the 
systems with Hamiltonians T and V, respectively, are 
obtained by solving the trivial differential equations 


JV = 0, v \p' = -VV(q), 

\q' = M- l p, (Pt ' \q' = 


We then note that the Stormer-Verlet method can be 
interpreted as a composition of the exact flows of the 
split differential equations: 

* S h = Vh/2 0 <Ph 0 <Pfc/2- 

Since the flows cp ; x /2 and c p\ are symplectic, so is their 
composition. 

Splitting the vector field of a differential equation and 
composing the flows of the subsystems is a structure- 
preserving approach that yields methods of arbitrary 
order and is useful in the time integration of a variety 
of ordinary and partial differential equations, such as 
linear and nonlinear Schrodinger equations. 


5.5 Variational Integrators 

For the Hamiltonian H(p,q) = |p x M _1 p + V(q), the 
Hamilton equations of motion p = -W(q), q = M~ 1 p 
can be combined to give the second-order differential 
equation 

Mq = -VV(q), 


which can be interpreted as the Euler-Lagrange equa- 
tions for minimizing the action integral 


'tN 

L(q(t),q(t))dt 

Jt 0 


with L(q, q) 


\ q T Mq - V (q) 


over all paths q ( t ) with fixed endpoints. In the Stormer- 
Verlet method, eliminating the momenta yields the sec- 
ond-order difference equations 


M (q n +\ 2 q n + qn-i ) — h^ W ( q n ) , 


which are the discrete Euler-Lagrange equations for 
minimizing the discretized action integral 

N - 1 

X li ./ t ( Qn+l ~ tf.n\ r / Q.n+1 ~ *ht\\ 

n=0 — h — ) +L \ qn+i ’ — h — ))’ 

which results from a trapezoidal rule approximation 
to the action integral and piecewise-linear approxima- 
tion to q(t). The Stormer-Verlet method can thus be 
interpreted as resulting from the direct discretization 
of the Hamilton variational principle. Such an interpre- 
tation can in fact be given for every symplectic method. 
Conversely, symplectic methods can be constructed by 
minimizing a discrete action integral. In particular, 
approximating the action integral by a quadrature for- 
mula and the positions q(t) by a piecewise polynomial 
leads to a symplectic partitioned Runge-Kutta method, 
which in general uses different Runge-Kutta formu- 
las for positions and momenta. With Gauss quadra- 
ture one reinterprets in this way the Gauss methods 
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of section 4.4, and with higher-order Lobatto quadra- 
ture formulas one obtains higher-order relatives of the 
Stormer-Verlet method. 

5.6 Oscillatory Problems 

Highly oscillatory solution behavior in Hamiltonian sys- 
tems typically arises when the potential is a multi- 
scale sum V = V /[slow] + yi fast ], where the Hessian of 
y[fast] positive eigenvalues that are large compared 
with those of v [slow] . (Here we assume that M = I 
for simplicity.) With standard methods such as the 
Stormer-Verlet method, very small time steps would 
be required, for reasons of both accuracy and stabil- 
ity. Various numerical methods have been devised with 
the aim of overcoming this limitation. We describe 
just one such method here: a multiple-time-stepping 
method that reduces the computational work signif- 
icantly when the slow force / [slow] = -VV [slow] is 
far more expensive to evaluate than the fast force 
y [fast] _ _yy[fast] A b as i c principle is to rely on 
averages instead of pointwise force evaluations. In the 
averaged-force method, the force f n = —W(q n ) in 
the Stormer-Verlet method is replaced by an averaged 
force f n as follows. We freeze the slow force at q n and 
consider the auxiliary differential equation 



with initial values u(0) = q n , w(0) = 0. We then define 
the averaged force as 

fn = | 1 i (l-|0|)(/ [slow] (q„)+/ [fast] (u(0h)))d0, 
which equals 

fn = j~2 (tt(h) - 2u(0) + u{-h )). 

The value u(h) is computed approximately using 
smaller time steps, noting that u(h) = u(-h). 

The argument of /I slow ] might preferably be replaced 
with an averaged value q n in order to mitigate the 
adverse effect of possible step-size resonances that 
appear when the product of h with an eigenfrequency 
of the Hessian is close to an integral multiple of tt. 

If the fast potential is quadratic, yhast] (q) = 1 q J Aq, 
the auxiliary differential equation can be solved exactly 
in terms of trigonometric functions of the matrix h 2 A. 
The resulting method can then be clewed as an expo- 
nential integrator, as considered in section 4.6. 


6 Boundary-Value Problems 

In a two-point boundary-value problem, the differential 
equation is coupled with boundary conditions of the 
same dimension: 

y'(t) = f(t,y(t)), a^t^b, 
r(y(a),y(b )) = 0 . 

As an important class of examples, such problems arise 
as the Euler-Lagrange equations of variational prob- 
lems, typically with separated boundary conditions 
r a (y(a)) = 0, r h (y(b)) = 0. 

6.1 The Sensitivity Matrix 

The problem of existence and uniqueness of a solu- 
tion is more subtle than for initial-value problems. For 
a linear boundary-value problem 

y'(t) = C(t)y(t)+g{t), a^t^b, 

Ay (a) + By(b) = q, 

a unique solution exists if and only if the sensitivity 
matrix E = A + BU(b,a ) is invertible, where U(t,s ) 
is the propagation matrix yielding v(t) = U(t,s)v(s) 
for every solution of the linear differential equation 
v'(t) = C(t)v(t). 

A solution of a nonlinear boundary-value problem is 
locally unique if the linearization along this solution 
has an invertible sensitivity matrix. 

6.2 Shooting 

Just as Newton’s method replaces a nonlinear system 
of equations with a sequence of linear systems, the 
shooting method replaces a boundary-value problem 
with a sequence of initial-value problems. The objec- 
tive is to find an initial value x such that the solution of 
the differential equation with this initial value, denoted 
y(t;x), satisfies the boundary conditions 

F(x) := r(x,y(b\x)) = 0. 

Newton’s method is now applied to this nonlinear sys- 
tem of equations; starting from an initial guess x°, one 
iterates 

x k+1 =x k + Ax k with DF ( x k )Ax k = -F(x k ). 

Here, the derivative matrix D F(x k ) turns out to be 
the sensitivity matrix E k of the linearization of the 
boundary-value problem along y ( t\ x k ) . In the kth iter- 
ation, one solves numerically the initial-value problem 
with initial value x k together with its linearization 

(Y k V(t) = d y f(t,y(t-x k ))Y k (t), Y k (a) = I. 
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6.3 Multiple Shooting 

The conceptual elegance of the shooting method— that 
it reduces everything to the solution of initial-value 
problems over the whole interval — can easily turn into 
its computational obstruction. Newton’s method may 
be very sensitive to the choice of the initial value x°. 
The norms of the matrices E^ 1 and U(b,a), which 
determine the effect of perturbations in the boundary- 
value problem and the initial-value problem, respec- 
tively, are unrelated and may differ widely. 

The problem can be avoided by subdividing the inter- 
val a = to < ti < ■ ■ ■ < tjv = b, shooting on every 
subinterval, and requiring continuity of the solution at 
the nodes t n . With y(t\t n ,x n ) denoting the solution 
of the differential equation that starts at t n with ini- 
tial value x n , this approach leads to a larger nonlinear 
system with the continuity conditions 

fn(Xn- hXn) — y {tn'ttn-U Xn-l) ~ Xn = 0 

for n = 1 ,...,N together with the boundary conditions 
F 0 (x 0 ,Xjv) := r(x 0 ,x N ) = 0. 

Newton’s method is now applied to this system of equa- 
tions. In each iteration one solves initial-value problems 
on the subintervals together with their linearization, 
and then a linear system with a large sparse matrix is 
solved for the increments in (xo, . . . , Xjv)- 

6.4 Collocation 

In the collocation approach to the boundary-value 
problem, one determines an approximation u(t) that is 
a continuous, piecewise polynomial of degree at most 5 
and that satisfies the boundary conditions and the dif- 
ferential equation at a finite number of collocation 
points t n ,i = t n - 1 + Ci(t n - t n -i) (for n = 1 ,... ,N and 
i = l,...,s): 

u'(t) = f(t,u(t)) at t = t Ut i, 
r(u(a),u(b)) = 0. 

The method can be interpreted, and implemented, as a 
multiple-shooting method in which a single step with a 
collocation method for initial-value problems, as con- 
sidered in section 4.4, is made to approximate the solu- 
tion in each subinterval. The most common choice, as 
first implemented by Ascher, Christiansen, and Russell 
in 1979, is collocation at Gauss nodes, which has good 
stability properties in the forward and backward direc- 
tions. The order of approximation at the grid points 
t n is p = 2s. Moreover, if the boundary-value prob- 
lem results from a variational problem, then Gauss 


collocation can be interpreted as a direct discretization 
of the variational problem (see section 5.5). 

7 Summary 

The numerical solution of ordinary differential equa- 
tions is an area driven by both applications and theory, 
with efficient computer codes alongside beautiful the- 
orems, both relying on the insight and knowledge of 
the researchers that are active in this field. It is an 
area that interacts with neighboring fields in com- 
putational mathematics (numerical linear algebra 
[IV. 10], the NUMERICAL SOLUTION OF PARTIAL DIFFER- 
ENTIAL EQUATIONS [IV.13], and OPTIMIZATION [IV.ll]), 
with the theory of differential equations [IV.2] and 
dynamical systems [IV. 20], and time and again with 
the application areas in science and engineering in 
which numerical methods for differential equations are 
used. 
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IV. 13 Numerical Solution of Partial 
Differential Equations 

Endre Stili 


1 Introduction 

Numerical solution of partial differential equations 
(PDEs) is a rich and active field of modern applied 
mathematics. The subject’s steady growth is stimulated 
by ever-increasing demands from the natural sciences, 
from engineering, and from economics to provide accu- 
rate and reliable approximations to mathematical mod- 
els involving PDEs whose exact solutions are either too 
complicated to determine in closed form or (as is the 
case for many) are not known to exist. While the his- 
tory of numerical solution of ordinary differential equa- 
tions is firmly rooted in eighteenth- and nineteenth- 
century mathematics, the mathematical foundations 
of the field of numerical solution of PDEs are much 
more recent; they were first formulated in the land- 
mark paper “Uber die partiellen Differenzengleichun- 
gen der mathematischen Physik” (“On the partial differ- 
ence equations of mathematical physics”) by Richard 
Courant, Karl Friedrichs, and Hans Lewy, which was 
published in 1928. Today, there is a vast array of pow- 
erful numerical techniques for specific PDEs, including 
the following: 

• level set and fast-marching methods for front- 
tracking and interface problems; 

• numerical methods for PDEs on (possibly evolving) 
manifolds; 

• immersed boundary methods; 

• mesh-free methods; 

• particle methods; 

• vortex methods; 

• various numerical homogenization methods and 
specialized numerical techniques for multiscale 
problems; 

• wavelet-based multiresolution methods; 

• sparse finite-difference and finite-element meth- 
ods, greedy algorithms, and tensorial methods for 
high-dimensional PDEs; 

• domain-decomposition methods for geometrically 
complex problems; and 

• numerical methods for PDEs with stochastic coef- 
ficients that arise in a variety of applications, 
including uncertainty quantification [11.34] 
problems. 


Our brief review cannot do justice to this huge and 
rapidly evolving subject. We shall therefore confine 
ourselves to the most standard and well-established 
techniques for the numerical solution of PDEs: finite- 
difference methods, finite-element methods, finite-vol- 
ume methods, and spectral methods. Before embarking 
on our survey, it is appropriate to take a brief excursion 
into the theory of PDEs in order to fix the relevant nota- 
tional conventions and to describe some typical model 
problems. 

2 Model Partial Differential Equations 

A linear partial differential operator L of order m with 
real-valued coefficients a a = a a (x), |«| ^ m, on a 
domain Q c ! d , defined by 

L := X a a (x)d a , x £ Q, 

is called elliptic if, for every x : = (xi , . . . , Xd ) e Q and 
every nonzero g := (gi, ■ ■ ■ , gd) £ R d , 

Qm (x, g) := X u«(x)g“* 0. 

\a\=m 

Here, a := (oq <Xd) is a d-component vector with 

nonnegative integer entries, called a multi-index, | a| := 
«i + ■ ■ ■ + (Xd is the length of the multi-index a, 3“ := 
3“,' ■ ■ ■ 3^, with d Xj := 3/3 xj, and g“ := g" 1 ■ ■ ■ g^*. In 
the case of complex-valued coefficients a a , the defini- 
tion above is modified by demanding that | Q m (x, g ) | * 
0 for all x £ Q and all nonzero g £ R d . A typi- 
cal example of a first-order elliptic operator with com- 
plex coefficients is the Cauchy-Riemann operator 3 ± '■= 
J, (3 X + id y ), where i := v -1- With this general def- 
inition of ellipticity, even-order operators can exhibit 
some rather disturbing properties. For example, the 
Bitsadze equation 3 xx u + 2i3 xy u - d yy u = 0 admits 
infinitely many solutions on the unit disk Q in R 2 cen- 
tered at the origin, all of which vanish on the bound- 
ary 3X2 of X2. Indeed, with z = x + i y, u(x,y ) = 
( 1 - | z| 2 )f(z) is a solution that vanishes on 3X2 for any 
complex analytic function /. A stronger requirement, 
referred to as uniform ellipticity, is therefore frequently 
imposed: for real-valued coefficients a a , |a| ^ m, and 
m = 2k, where k is a positive integer, uniform elliptic- 
ity demands the existence of a constant C > 0 such that 
(— 1 )^Q 2 k (x, g) ^ C | g | 2fc for all x £ X2 and all nonzero 
g £ R d . 

The archetypal linear second-order uniformly ellip- 
tic PDE is -A u + c(x)u = fix), x £ X2. Here, c 
and / are real-valued functions defined on Q, and 
A := Zti 3£, is the Laplace operator. When c < 0 the 
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equation is called the Helmholtz equation. In the spe- 
cial case when c(x) = 0 the equation is referred to as 
Poisson's equation, and when c(x) = 0 and fix) = 0 
it is referred to as Laplace’s equation. Elliptic PDEs 
arise in a range of mathematical models in continuum 
mechanics, physics, chemistry, biology, economics, and 
finance. For example, in a two-dimensional flow of an 
incompressible fluid with flow velocity u = (wi, ii 2 , 0), 
the stream function (p, related to m by a = V x (0, 0, (p ) , 
satisfies Laplace’s equation. The potential of a grav- 
itational field, due to an attracting massive object of 
density p, satisfies Poisson’s equation A4> = 4nGp, 
where G is the universal gravitational constant. 

More generally, one can consider fully nonlinear 
second-order PDEs: 

F)x,u,Wu,D 2 u) = 0, 

where F is a real-valued function defined on the set 
Y := T! x R x R d x Rsymm’ with a typical element 
v := ix,z,p,R), where X£fl,zel,p£ R d , and R e 
Rsyrmn: LI is an open set in R d ; D 2 u denotes the Hessian 
matrix of it; and Rsy > $^n is the did + 1) / 2 -dimensional 
linear space of real symmetric dx d matrices, d ^ 2. 
An equation of this form is said to be elliptic on Y 
if the dx d matrix whose entries are dF/dRij, i,j = 
1 ,... ,d, is positive-definite at each u e Y. An impor- 
tant example, encountered in connection with optimal 
transportation problems, is the Monge-Ampere equa- 
tion: detD 2 u = fix) with x £ fl. For this equation to 
be elliptic it is necessary to demand that the twice con- 
tinuously differentiable function u be uniformly con- 
vex at each point of Q, and for such a solution to exist 
we must also have / positive. 

Parabolic and hyperbolic PDEs typically arise in math- 
ematical models in which one of the independent phys- 
ical variables is time, denoted by t. For example, the 
PDEs 

3 t u + Lu = f and 3ttu + Lu=f, 
where I is a uniformly elliptic partial differential oper- 
ator of order 2m and u and f are functions of 
(t,Xi, . . . ,xf), are uniformly parabolic and uniformly 
hyperbolic, respectively. The simplest examples are 
the (uniformly parabolic) unsteady heat equation and 
the (uniformly hyperbolic) second-order wave equation, 
where 

d 

Lu := - X d Xj .(aij(t,x)d Xi u), 
ij = 1 

and where aijit,x) = ay (t,Xi, ...,Xd), i,j = 1 ,...,d, 
are the entries of a d x d matrix, which is positive- 
definite, uniformly with respect to (t,xi Xd). 


Not all PDEs are of a certain fixed type. For example, 
the following PDEs are mixed elliptic-hyperbolic ; they 
are elliptic for x > 0 and hyperbolic for x < 0: 

d X xU + sign ix)d yy u = 0 (Lavrentiev equation), 
d xx u + xdyyU = 0 (Tricomi equation), 
xd xx u + dyyU = 0 (Keldysh equation). 

Stochastic analysis is a fertile source of PDEs of 
nonnegative characteristic form, such as 
d d 

3 t u - X dxj(aijd Xi u ) + X bidx f u + cu = f, 
i,j= 1 i=l 

where fc;, c, and / are real-valued functions of (t,Xi, 

...,Xd), and ay = ay(t,Xi, . . . ,x<j), i,j = 1 d, 

are the entries of a positive-semidefinite matrix; since 
the ay are dependent on the temporal variable t, the 
equation is, potentially, of changing type. An important 
special case is when the ay are all identically equal to 
zero, resulting in the following first-order hyperbolic 
equation, which is also referred to as the advection (or 
transport) equation : 
d 

dtu + X bi(t,x)d Xi u + cit,x)u = f)t,x). 

i=i 

The nonlinear counterpart of this equation, 

d 

3 t u + d Xi [fit,x,u)] = 0, 

i = l 

plays an important role in compressible fluid dynamics, 
traffic flow models, and flow in porous media. Special 
cases include the Burgers equation dtu + 3 x i\u 2 ) = 0 
and the Buckley-Leverett equation 3 t u + 3 x iu 2 Hu 2 + 
i(l-u) 2 )) = 0. 

PDEs are rarely considered in isolation; additional 
information is typically supplied in the form of bound- 
ary conditions (imposed on the boundary 3Q of the 
domain Q c R d in which the PDE is studied) or, in the 
case of parabolic and hyperbolic equations, as initial 
conditions at t = 0. The PDE in tandem with the bound- 
ary/initial conditions is referred to as a boundary-value 
problem/ initial-value problem or, when both boundary 
and initial data are supplied, as an initial-boundary- 
value problem. 

3 Finite-Difference Methods 

We begin by considering finite-difference methods 
for elliptic boundary-value problems. The basic idea 
behind the construction of finite-difference methods is 
to discretize the closure, Q, of the (bounded) domain 
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of definition fi c R d of the solution (the so-called ana- 
lytical solution) to the PDE by approximating it with a 
finite set of points inR d , called the mesh points or grid 
points, and replacing the partial derivatives of the ana- 
lytical solution appearing in the equation by divided dif- 
ferences (difference quotients) of a grid function, i.e., 
a function that is defined at all points in the finite- 
difference grid. The process results in a finite set of 
equations with a finite number of unknowns, the values 
of the grid function representing the finite-difference 
approximation to the analytical solution over the finite- 
difference grid. We illustrate the construction by con- 
sidering a simple second-order uniformly elliptic PDE 
subject to a homogeneous Dirichlet boundary' condition: 


-Am + c(x,y)u = f(x,y) in.fi, (1) 

u = 0 on 3fi (2) 

on the unit square fi := (0, l) 2 ; here, c and / are real- 
valued functions that are defined and continuous on fi, 
and c ^ 0 on fi. Let us suppose for simplicity that the 
grid points are equally spaced. Thus we take h := l/N, 
where N ^ 2 is an integer. The corresponding finite- 
difference grid is then fij, := { ( Xi,yj ) : i,j = 0, . . . ,N}, 
where X; := ih and yj := jh, i,j = 0 We also 
define fi;, := fij, n fi and 3fi;, := fi;, \ fi;,. 

It is helpful to introduce the following notation for 
first-order divided differences : 

u(x i+ i,yj) - u(xi,yj) 


D+u(xi,yj) := 


h 


and 


D x u(xi,yj) 


u(xi,yj) - u(Xj-i,yj) 
h 


with D+w (Xi, yj) and D y (x,, yj) defined analogously. 
Then, 


D x u{xi,yj) := D x D x u(Xi,yj), 

DyU{xt,yj) := DyDyU(Xi,yj) 

are referred to as the second-order divided difference of 
u in the x-direction and the j'-direction, respectively, 
at ( xuyj ) 6 fi;,. 

Assuming that u e C 4 (fi) (i.e., that u and all of its 
partial derivatives up to and including those of fourth 
order are defined and continuous on fi), we have that, 
at any ( x it yj ) e fi;„ 

D x u{Xi,yj) = |^( Xi,yj ) + 0(h 2 ) ( 3 ) 


and 


DyU(Xi.yj) 


d 2 u 

dy 2 


( Xi,yj ) + 0(h 2 ) 


( 4 ) 


as h — ■ 0. Omission of the 0(h 2 ) terms in (3) and (4) 
above yields that 

2 , , d 2 u , 

D x u(Xi,Vj) « ^( Xi,yj ), 

DyU(xuyj) « | -^(Xi.yj), 

where the symbol signifies approximate equality in 
the sense that as h — ■ 0 the expression on the left of the 
symbol converges to the expression on its right. Hence, 

- (D 2 x u(x u yj) + DyU(Xi,yj)) + c(x,, jy)u(x;, jy) 

~ f(Xi,yj) for all (x, ,yj) e Q h , 

(5) 

u(Xi,yj) = 0 for all ( xi,yj ) e 3fi;,. (6) 

It is instructive to note the similarity between (1) and 
(5), and between (2) and (6). Motivated by the form of 
(5) and (6), we seek a grid function U whose value at 
the grid point ( Xi,yj ) e fi;,, denoted by (7y, approx- 
imates u(Xi,yj), the unknown exact solution to the 
boundary-value problem (1), (2) evaluated at (Xi,yj), 
i,j = 0 We define U to be the solution to the 
following system of linear algebraic equations: 

- (D x Uij + D 2 y Uij) + c{Xi,yj)Uij 

= fixuyj ) for all (xi,yj)(=n h , (7) 

Uij = 0 for all ( Xi,yj ) e 3 fi;,. (8) 

As each equation in (7) involves five values of the 
grid function U (namely, Uij, Ui-\,j, Ut+ij, Uij-i, 
Uij+i), the finite-difference method (7) is called the 
five-point difference scheme. The matrix of the lin- 
ear system (7), (8) is sparse, symmetric, and positive- 
deflnite, and for given functions c and f it can be 
efficiently solved by iterative techniques from numer- 
ical LINEAR ALGEBRA [IV.10], including KRYLOV SUB- 
SPACE [11.23] type methods (e.g., the conjugate gradi- 
ent method) and multigrid methods. Multigrid methods 
were developed in the 1970s and 1980s and are widely 
used as the iterative solver of choice for large sys- 
tems of linear algebraic equations that arise from Unite- 
difference and finite-element approximations in many 
industrial applications. The key objective of a multigrid 
method is to accelerate the convergence of standard 
iterative methods (such as Jacobi iteration and succes- 
sive over-relaxation) by using a hierarchy of coarser- 
to-finer grids. A multigrid method with an intention- 
ally reduced convergence tolerance can also be used as 
an efficient preconditioner for a Krylov subspace itera- 
tion. The preconditioner P for a nonsingular matrix A 
is an approximation of A -1 whose purpose is to ensure 
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that PA is a good approximation of the identity matrix, 
thereby ensuring that iterative algorithms for the solu- 
tion of the preconditioned version, PAx = Pb, of the 
system of linear algebraic equations Ax = b exhibit 
rapid convergence. 

One of the central questions in the numerical analysis 
of PDEs is the mathematical study of the approximation 
properties of numerical methods. We will illustrate this 
by considering the finite-difference method (7), (8). The 
grid function T defined on Q\ x by 

Tij := -(D 2 x u(x u yj) + D 2 u(Xi,yj)) 

+ c(Xi,yj)u(Xi,yj) - f(xuyj) ( 9 ) 

is called the truncation error of the finite-difference 
method (7), (8). Assuming that u e C 4 (0), it follows 
from (3)-(5) that, at each grid point ( Xi,yj ) e Oh, 
Tij = 0{h 2 ) as h — 0. The exponent of h in the state- 
ment Tij = 0(h 2 ) (which, in this case, is equal to 2) is 
called the order of accuracy (or order of consistency) of 
the method. 

It can be shown that there exists a positive constant 
Co, independent of h, U, and f, such that 

. N N-l 

(h 2 X X \D x Uij\ 2 

V t=l j= 1 

N-l N N-l N-l v 1/2 

+ h 2 X I \D-Uij\ 2 + h 2 X I \Uij\ 2 ) 
i= 1 j= 1 i=l j= l 

N-l N-l 1/2 

^ Co[h 2 X X l/(^i-yj)l 2 ) ■ (10) 

V i=l j= 1 7 

Such an inequality (which expresses the fact that the 
numerical solution U e Sh,o is bounded by the data (in 
this case / e Sh) uniformly with respect to the grid size 
h, where Sh,o denotes the linear space of all grid func- 
tions defined on Oh that vanish on 3 Oh and where Sh 
is the linear space of all grid functions defined on Oh) 
is called a stability inequality. The smallest real number 
Co > 0 for which (10) holds is called the stability con- 
stant of the method. It follows in particular from (10) 

that if fij = 0 for all i,j = 1 N-l, then Uy = 0 for 

all i,j = 0, . . . , N. Therefore, the matrix of the system 
of linear equations (7), (8) is nonsingular, which then 
implies the existence of a unique solution U to (7), (8) 
for any h = l IN, N ^ 2. Consider the difference opera- 
tori/,: U G Sh., o — / = LhU G Sh defined by (7), (8). The 
left-hand side of (10) is sometimes denoted by ||Uj|i,/, 
and the right-hand side by ll/llo.Jil hence, the stability 
inequality (10) can be rewritten as 

IIU|ll 1 fc<Coll/llo,fc 


with / = LhU, and stability can then be seen to be 
demanding the existence of the inverse to the linear 
finite-difference operator Lj, : Sh, o — Sh , and its bound- 
edness, uniformly with respect to the discretization 
parameter h. The mapping U e Sh,o — ||t/|li,ji Glisa 
norm on Sh, o, called the discrete (Sobolev) H 1 (O)-norm, 
and the mapping / G Sh — WfWo.h Glisa norm on 
Sh, called the discrete L 2 (Q)-norm. It should be noted 
that the stability properties of finite-difference meth- 
ods depend on the choice of norm for the data and for 
the associated solution. 

In order to quantify the closeness of the approxi- 
mate solution U to the analytical solution u at the grid 
points, we define the global error e of the method (7), 
(8) by ey := u(Xi,yj) - Uij. Clearly, the grid function 
e = u - U satisfies (7), (8) if f(Xi,yj) on the right- 
hand side of (7) is replaced by Tij ■ Hence, by the stabil- 
ity inequality, || w - !7||i,/, = || e || i,/, ^ c 0 imi 0 ,h. Under 
the assumption that u G C 4 (li) we thus deduce that 
\\u-U\\i,h ^ Ci h 2 , where ci is a positive constant, inde- 
pendent of h. The exponent of h on the right-hand side 
(which is 2 in this case) is referred to as the order of con- 
vergence of the finite-difference method and is equal 
to the order of accuracy. Indeed, the fundamental idea 
that stability and consistency together imply conver- 
gence is a recurring theme in the analysis of numerical 
methods for differential equations. 

The five-point difference scheme can be generalized 
in various ways. For example, instead of using the same 
grid size h in both coordinate directions, one could 
have used a grid size Ax = 1/M in the x-direction 
and a possibly different grid size A y = 1/N in the 
y-direction, where M,N ^ 2 are integers. One can 
also consider boundary-value problems on more com- 
plicated polygonal domains O in R 2 such that each 
edge of O is parallel with one of the coordinate axes, 
for example, the L-shaped domain ( — 1, l) 2 \ [0, l] 2 . 
The construction above can be extended to domains 
with curved boundaries in any number of dimensions; 
at grid points that are on (or next to) the boundary, 
divided differences with unequally spaced grid points 
are then used. 

In the case of nonlinear elliptic boundary-value 
problems, such as the Monge-Ampere equation on a 
bounded open set O c R d , subject to the nonhomo- 
geneous Dirichlet boundary condition u = g on dQ, 
a finite-difference approximation is easily constructed 
by replacing at each grid point (Xi,yj) e O the value 
u(Xi,yj) of the analytical solution u (and its partial 
derivatives) in the PDE with the numerical solution Uij 
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(and its divided differences) and by then imposing the 
numerical boundary condition I7y = g(Xi,yj) for all 
(xuyj) e dQ h- Unfortunately, such a simple-minded 
method does not explicitly demand the convexity of U 
in any sense, and this can lead to instabilities. In fact, 
there is no reason why the sequence of finite-difference 
solutions should converge to the (convex) analytical 
solution of the Monge-Ampere equation as h — ■ 0. Even 
in two space dimensions the resulting method may 
have multiple solutions, and special iterative solvers 
need to be used to select the convex solution. Enforc- 
ing convexity of the finite-difference solution in higher 
dimensions is much more difficult. A recent develop- 
ment in this field has been the construction of so-called 
wide-stencil finite-difference methods, which are mono- 
tone, and the convergence theory of Barles and Sougani- 
dis therefore ensures convergence of the sequence of 
numerical solutions, as h — 0, to the unique viscosity 
solution of the Monge-Ampere equation. 

We close this section on finite-difference methods 
with a brief discussion about their application to time- 
dependent problems. A key result is the Lax equiva- 
lence theorem, which states that, for a finite-difference 
method that is consistent with a well-posed initial- 
value problem for a linear PDE, stability of the method 
implies convergence of the sequence of grid functions 
defined by the method on the grid to the analytical solu- 
tion as the grid size converges to zero, and vice versa. 
Consider the unsteady heat equation ut - A u + u = 0 
for t G (0, T], with T > 0 given, and ( x,y ) in the unit 
square Q = (0, 1 ) 2 , subject to the homogeneous Dirich- 
let boundary condition u = 0 on (0, T] x dQ and the ini- 
tial condition u(0,x,y) = uo(x,y), (x ,y) g Q, where 
uo and / are given real-valued continuous functions. 
The computational domain [0, T] x Q is discretized 

by the grid {t m = mAt: m = 0 M} x Qh, where 

At = T/M, M ^ 1, and h = 1/N, N ^ 2. We consider 
the 0-method 

jjm+1 _ jjm 

ij — iJ - - (D 2 X U% +0 + DyUfJ +0 ) + U% +e = 0 

for all i,j = 1, ... ,N — 1 and m = 0, ... ,M - 1, sup- 
plemented with the initial condition Ufj = uo(Xi,yj), 
i,j = 0, ... ,1V, and the boundary condition U ™ +1 = 0, 
m = 0, . . . , M - 1, for all ( i,j ) such that ( Xi ,yj) g dQh . 
Elere, 0 G [0,1] and 


u^ +e ■= (1 - 0)U"f + 0Ufl +l , 

L J L J l J 


with Ufj and t/™ +1 representing the approximations 
to u(t m ,Xi,yj) and u(t m+1 ,xi,yj), respectively. The 


values 0 = 0, |, 1 are particularly relevant; the corre- 
sponding finite-difference methods are called the for- 
ward (or explicit) Euler method, the Crank-Nicolson 
method, and the backward (or implicit) Euler method, 
respectively; their truncation errors are defined by 

m+i u(t m+1 ,Xi,yj) - u(t m ,Xi,yj) 

" ' At 

- (1 - 0) (D] c u(t m , Xi, y j) + DyU(t m ,Xi,yj)) 

- 0 (D 2 u(t m+1 , Xi, yj) + D%u(t m+1 , Xi, y j)) 

+ (1 - 0)u(t m ,Xi,yj) + 0u(t m+1 ,Xi,yj) 


for i,j = 1, ... ,1V - 1, m = 0, ... ,M - 1. Assuming 
that u is sufficiently smooth, Taylor series expansion 
yields that Ty = O (At + h 2 ) for 0 * \ and Ty = 
O ((At) 2 + h 2 ) for 0 = Thus, in particular, the for- 
ward and backward Euler methods are first-order accu- 
rate with respect to the temporal variable t and second- 
order accurate with respect to the spatial variables x 
and y, whereas the Crank-Nicolson method is second- 
order accurate with respect to both the temporal vari- 
able and the spatial variables. The stability properties 
of the 0-method are also influenced by the choice of 
0 G [0, 1]; we have that 


max 


u n 


II 0,1 1 


M - 1 

-At X 

m = 0 


u 


m+9 1 


2 

l,h 


\Tj0 \\ 2 

I u "0,h 


for 0 G [0, \), provided that 2d(l - 20)A t ^ h 2 , with 
d = 2 (space dimensions) in our case; and for 0 G [ ^ , 1 ] , 
irrespective of the choice of At and h. Thus, in particu- 
lar, the forward (explicit) Euler method is conditionally 
stable, the condition being that 2d At ^ h 2 , with d = 2 
here, while the Crank-Nicolson and backward (implicit) 
Euler methods are unconditionally stable. 

A finite-difference method approximates the analyt- 
ical solution using a grid function that is defined over 
a finite-difference grid contained in the computational 
domain. Next we will consider finite-element methods, 
which involve piecewise polynomial approximations of 
the analytical solution, defined over the computational 
domain. 


4 Finite-Element Methods 

Finite-element methods (FEMs) are a powerful and gen- 
eral class of techniques for the numerical solution of 
PDEs. Their historical roots can be traced back to a 
paper by Richard Courant, published in 1943, that pro- 
posed the use of continuous piecewise affine approxi- 
mations for the numerical solution of variational prob- 
lems. This represented a significant advance from a 
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practical point of view over earlier techniques by Ritz 
and Galerkin from the early 1900s, which were based 
on the use of linear combinations of smooth functions 
(e.g., eigenfunctions of the differential operator under 
consideration). The importance of Courant’s contribu- 
tion was, unfortunately, not recognized at the time and 
the idea was forgotten until the early 1950s, when it 
was rediscovered by engineers. FEMs have since been 
developed into an effective and flexible computational 
tool with a firm mathematical foundation. 

4. 1 FEMs for Elliptic PDEs 

Suppose that 13 c R d is a bounded open set in R d with 
a Lipschitz-continuous boundary 313. We will denote by 
I 2 (13) the space of square-integrable functions (in the 
sense of Lebesgue) equipped with the norm [|u||o := 
(J n |i/| 2 dx) 1/2 . Let H m (l 3) denote the Sobolev space 
consisting of all functions v £ L 2 (13) whose (weak) par- 
tial derivatives d a v belong to I 2 (13) for all « such that 
| ex | ^ to. H m (Q) is equipped with the norm ||u|| m := 
II d a v 1| q) 1/2 . We denote by Hq (13) the set of all 
functions v £ H 1 (13) that vanish on 313. 

Let a and c be real-valued functions, defined and con- 
tinuous on 13, and suppose that there exists a positive 
constant c o such that a(x) > Co for all x £13. Assume 
further that bt, i = 1 are continuously differ- 

entiable real-valued functions defined on 13 such that 
c - \ V ■ b ^ co on 13, where b := (b\, . . . ,bd), and let 
/ £ L 2 (13). Consider the boundary-value problem 

-V ■ (fl(x)Vu) + b(x) ■ Vw + c(x)u = f(x) 

for x £ 13, with u\dn = 0. The construction of the 
finite-element approximation of this boundary-value 
problem commences by considering the following weak 
formulation of the problem: find u £ Hg(13) such that 

B(u,v)=£(v) Vv £ Hg(13), (11) 

where the bilinear form B ( - , - ) is defined by 
B(w, v ) 

:= [a(x)Ww ■ Vu + b(x) ■ Vwv + c(x)wv] dx 

Jn 

and f(v) := j n fv dx, with w , v £ Hq (13). If u is suf- 
ficiently smooth (for example, if u £ H 2 (Q) n Hq( 13)), 
then integration by parts in (11) implies that u is a 
strong solution of the boundary -value problem, i.e., 
-V ■ (a(x)Vw) + b(x) ■ Vm + c(x)u = f(x) almost 
everywhere in 13, and u\sn = 0. More generally, in 
the absence of such an additional assumption about 
smoothness, the function u £ Hg(13) satisfying (11) is 


Figure 1 Finite-element triangulation of the computational 
domain 13, a polygonal region of R 2 . Vertices on 313 are 
denoted by solids dots, and vertices internal to 13 by circled 
solid dots. 

called a weak solution of this elliptic boundary-value 
problem. Under our assumptions on a, b, c, and /, the 
existence of a unique weak solution follows from the 
Lax-Milgram theorem. 

We will consider the finite-element approximation of 
(11) in the special case when 13 is a bounded open 
polygonal domain in R 2 . The first step in the construc- 
tion of the FEM is to define a triangulation of 13. A trian- 
gulation of 13 is a tessellation of 13 into a finite number 
of closed triangles T;, i = 1, . . . ,M, whose interiors are 
pairwise disjoint, and for each i,j £ {1, . . . ,M}, i * j, 
for which T; n Tj is nonempty, Tj n Tj is either a com- 
mon vertex or a common edge of Tj and Tj (see figure 1). 
The vertices in the triangulation are also referred to as 
nodes. 

Let hj denote the longest edge of a triangle T in the 
triangulation, and let h be the largest among the hr- 
Furthermore, let Sh denote the linear space of all real- 
valued continuous functions Vh defined on 13 such that 
the restriction of vh to any triangle in the triangulation 
is an affine function, and define Sh,o '■= Sh^iHglD). The 
finite-element approximation of the problem (11) is as 
follows: find it/, in the finite-element space Sh,o such 
that 

B(u h ,v h ) = f(v h ) Vv h eS h ,o- (12) 

Let us denote by X;, i = 1 1, the set of all vertices 

(nodes) in the triangulation (see figure 1), and let N = 
N(h) denote the dimension of the Unite-element space 
5/i,o- We will assume that the vertices Xj, i = 1 
are numbered so that Xj, i = 1, . . . , N, are within 13 and 
the remaining L - N vertices are on 313. furthermore, 
let [qpj : j = 1, ... ,N} c Sh,o denote the so-called nodal 
basis for 5j,,o, where the basis functions are defined 
by qpj(xi) = Stj, i = 1 ,...,!, J = 1, ... ,7V. A typical 
piecewise-linear nodal basis function is shown in fig- 
ure 2. Thus, there exists a vector U = (Ui, . . . , Un) t £ 
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Figure 2 A typical piecewise-linear nodal basis function. 
The basis function is identically zero outside the patch of 
triangles surrounding the central node, at which the height 
of the function is equal to 1. 


R n such that 

N 

u h (x) = X Ujqpj(x). (13) 

j'= i 

Substitution of this expansion into (12) and taking v h = 
qpk,k = 1 , IV, yields the following system of N linear 
algebraic equations in the N unknowns, Ui , . . . , f/jy: 

N 

XB(qpj,<p k )Uj = £(q> k ), k = 1, ... ,7V. (14) 

1=1 

By recalling the definition of £(-, ■), we see that the 
matrix A := ( [i? (cp>j , ) ]^ fc=1 ) T of this system of lin- 

ear equations is sparse and positive-definite (and, if b is 
identically zero, then the matrix is symmetric as well). 
The unique solution U = (Ui,..., f/jv) T e M N of the lin- 
ear system, upon substitution into (13), yields the com- 
puted approximation Uh to the analytical solution u on 
the given triangulation of the computational domain Q, 
using numerical algorithms for sparse linear systems 
[IV. 10 §6], including Krylov subspace type methods and 
multigrid methods. 

As Sh.o is a (Unite-dimensional) linear subspace of 
Hg(Q ), v = Vh is a legitimate choice in (11). By 
subtracting (12) from (11), with v = Vh, we deduce that 

B(u-u h ,v h ) = 0 Vi’h e Sh,o, (15) 

which is referred to as the Galerkin orthogonality prop- 
erty of the FEM. Hence, for any Vh £ Sh.o, 

CO || XL - Uh [if ^ B(U - Uh,U - Uh) 

= B(u - Uh, u - Vh) 

^ C\ || U — Uh ||i ||li — V^||i, 

where 

Cl := (Ml+Ml + M 2 ) 1/2 , 

withMt, := max xe o \v[x)\, v £ {a,b,c}. We therefore 
have that 

\\u — UhWi 4, — min II u-Vh\\i- (16) 

Cq Vh^Sh.O 


This result is known as Cea’s lemma, and it is an impor- 
tant tool in the analysis of FEMs. Suppose, for exam- 
ple, that u £ H 2 (Q) n Hq(O), and denote by J/, the 
finite-element interpolant of u defined by 

N 

IhU(x) := X u(Xj)qpjlx). 
j= i 

It follows from (16) that \\u-Uh\h ^ (ci/Co)l|w-/j l 'M||i. 
Assuming further that the triangulation is shape regu- 
lar in the sense that there exists a positive constant c* , 
independent of h, such that for each triangle in the tri- 
angulation the ratio of the longest edge to the radius 
of the inscribed circle is bounded below by c*, argu- 
ments from approximation theory imply the existence 
of a positive constant c, independent of h, such that 
|| xl - I h u\h ^ ch\\u\\ 2 - Hence, the following a priori 
error bound holds in the H 1 -norm: 

|| u - UhWi ^ (ci/co)ch||u|| 2 . 

We deduce from this inequality that, as the triangu- 
lation is refined by letting h 0, the sequence of 
finite-element approximations Uh computed on succes- 
sively refined triangulations converges to the analyt- 
ical solution u in the IT 1 -norm. It is also possible to 
derive a priori error bounds in other norms, such as 
the 1 2 -norm. 

The inequality (16) of Cea’s lemma can be seen to 
express the fact that the approximation Uh 6 Sh.o to 
the solution u 6 Hq(Q) of (11) delivered by the FEM 
(12) is a near-best approximation to u from the linear 
subspace Sh.o of Hq{0). Clearly, ci/co ^ 1. When the 
constant c\/cq » 1, the numerical solution Uh sup- 
plied by the FEM is typically a poor approximation to 
u in the || ■ ||i-norm, unless h is very small; for exam- 
ple, if a (x) = c(x) = e andb(x) = (1, 1) T , thencj/co = 
V2(l + £ 2 ) 1/2 /£ »lif0<£«:l. Such non-self-adjoint 
elliptic boundary-value problems arise in mathematical 
models of diffusion-advection-reaction, where advec- 
tion dominates diffusion and reaction in the sense that 
|b(x)| » a{x) > 0 and \b(x)\ » c(x) > 0 for all 
x e Q. The stability and approximation properties of 
the classical FEM (12) for such advection-dominated 
problems can be improved by modifying, in a consis- 
tent manner, the definitions of B ( ■ , ■ ) and £ ( ■ ) through 
the addition of “stabilization terms” or by enriching the 
finite-element space with special basis functions that 
are designed so as to capture sharp boundary and inte- 
rior layers exhibited by typical solutions of advection- 
dominated problems. The resulting FEMs are generally 
referred to as stabilized finite-element methods. A typ- 
ical example is the streamline-diffusion finite-element 
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method , in which the bilinear form of the standard FEM 
is supplemented with an additional numerical diffusion 
term, which acts in the streamwise direction only, i.e., 
in the direction of the vector b, in which classical FEMs 
tend to exhibit undesirable numerical oscillations. 

If, on the other hand, b is identically zero on Q, 
then B ( ■ , ■ ) is a symmetric bilinear form, in the sense 
that B(w,v ) = B(v,w) for all iu,v e Hq(Q). The 
norm || ■ || # defined by Ill'll# := [B(v, v)] 1/2 is called 
the energy norm on Hq(Q) associated with the ellip- 
tic boundary -value problem (11). In fact, (11) can then 
be restated as the following (equivalent) variational 
problem: find u G Hq (13) such that 

J(u)^J(v) Vvg 

where 

J(v) := ^B(v,v) - £(v). 

Analogously, the FEM (12) can then be restated equiv- 
alently as follows: find u/, e Sh, o such that J(u; ,) ^ 
J(v h ) for all Vh e Sh, o- Furthermore, Cea’s lemma, in 
terms of the energy norm, || ■ ||#, becomes || u - u\ x ||# = 
min Vjl sSh.o II u ~ VhII b- Thus, in the case when the func- 
tion b is identically zero the numerical solution Uh G 
Sh , o delivered by the FEM is the best approximation to 
the analytical solution u G Hq (Q) in the energy norm 

We illustrate the extension of these ideas to nonlin- 
ear elliptic PDEs through a simple model problem. For a 
real number p e (1, oo), let L P (Q) := {v : j n \v\ p dx < 
oo} and W l ' p (Q) := {v g L P (Q): |Vv| e L p (£2)}. 
Furthermore, let Wq (Q) denote the set of all v e 
W 1 ' P (L 2) such that v\ sn = 0. For / G L Ll (Q), where 
1/p + l/q = 1, p e (l,oo), consider the problem of 
finding the minimizer u G W 0 (O) of the functional 

/(v) := — f IWI^dx- f fv dx, v g W^’ p (O). 
p Jn Jn 

With Sh, o as above, the finite-element approximation of 
the problem then consists of finding it# G Sh,o that 
minimizes J(vh ) over all u;, G Sh o- The existence and 

1 m 

uniqueness of the minimizers u G W 0 (O) and Uh e 
Sh , o in the respective problems is a direct consequence 
of the convexity of the functional J. Moreover, as h — ■ 0, 
Uh converges to u in the norm of the Sobolev space 
2). 

Problems in electromagnetism and continuum me- 
chanics are typically modeled by systems of PDEs in- 
volving several dependent variables, which may need to 
be approximated from different finite-element spaces 
because of the disparate physical nature of the vari- 
ables and the different boundary conditions that they 


may be required to satisfy. The resulting finite-element 
methods are called mixed finite-element methods. In 
order for a mixed FEM to possess a unique solution and 
for the method to be stable, the finite-element spaces 
from which the approximations to the various compo- 
nents of the vector of unknowns are sought cannot be 
chosen arbitrarily; they need to satisfy a certain com- 
patibility condition, usually referred to as the inf-sup 
condition. 

FEMs of the kind described in this section— where the 
finite-element space containing the approximate solu- 
tion is a subset of the function space in which the 
weak solution to the problem is sought— are called con- 
forming finite-element methods. Otherwise, the FEM is 
called nonconforming. Nonconforming FEMs are nec- 
essary in some application areas because in certain 
problems (such as div-curl problems from electromag- 
netism and variational problems exhibiting a Lavren- 
tiev phenomenon, for example), conforming FEMs may 
converge to spurious solutions. Discontinuous Galerkin 
finite-element methods (DGFEMs) are extreme instances 
of nonconforming FEMs, in the sense that pointwise 
interelement continuity requirements in the piecewise 
polynomial approximation are completely abandoned, 
and the analytical solution is approximated by dis- 
continuous piecewise polynomial functions. FEMs have 
several advantages over finite-difference methods: the 
concept of higher-order discretization is inherent to 
FEMs; it is, in addition, particularly convenient from the 
point of view of adaptivity that FEMs can easily accom- 
modate very general tessellations of the computational 
domain, with local polynomial degrees in the approxi- 
mation that may vary from element to element. Indeed, 
the notion of adaptivity is a powerful and important 
idea in the field of numerical approximation of PDEs, 
and it is this that we will further elaborate on in the 
context of finite-element methods below. 

4.2 A Posteriori Error Analysis and Adaptivity 

Provided that the analytical solution is sufficiently 
smooth, a priori error bounds guarantee that, as the 
grid size h tends to 0, the corresponding sequence of 
numerical approximations converges to the exact solu- 
tion of the boundary-value problem. In practice, one 
may unfortunately only be able to afford to compute 
on a small number of grids/triangulations, the mini- 
mum grid size attainable being limited by the compu- 
tational resources available. A further practical consid- 
eration is that the regularity of the analytical solution 
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may exhibit large variations over the computational 
domain, with singularities localized at particular points 
(e.g., at corners and edges of the domain) or low- 
dimensional manifolds in the interior of the domain 
(e.g., shocks and contact discontinuities in nonlinear 
conservation laws, or steep internal layers in advection- 
dominated diffusion equations). The error between the 
unknown analytical solution and numerical solutions 
computed on locally refined grids, which are best suited 
for such problems, cannot be accurately quantified by 
typical a priori error bounds and asymptotic conver- 
gence results that presuppose uniform refinement of 
the computational grid as the grid size tends to 0. 
The alternative is to perform a computation on a cho- 
sen computational grid/triangulation and use the com- 
puted approximation to the exact solution to quan- 
tify the approximation error a posteriori, and also to 
identify parts of the computational domain in which 
the grid size was inadequately chosen, necessitating 
local, so-called adaptive, refinement or coarsening of 
the computational grid/triangulation (h- adaptivity). In 
FEMs it is also possible to locally vary the degree of 
the piecewise polynomial function in the Unite-element 
space (p -adaptivity). Finally, one may also make adjust- 
ments to the computational grid/triangulation by mov- 
ing/relocating the grid points (r -adaptivity). The adap- 
tive loop for an h -adaptive FEM has the following 
form: 

SOLVE - ESTIMATE - MARK - REFINE. 

Thus, a finite-element approximation is first computed 
on a certain fixed, typically coarse, triangulation of the 
computational domain. Then, in the second step, an 
a posteriori error bound is used to estimate the error 
in the computed solution; a typical a posteriori error 
bound for an elliptic boundary -value problem Lu = f 
(where I is a second-order uniformly elliptic opera- 
tor and / is a given right-hand side) is of the form 
\\u - ui , || i ^ C*\\R(u h )\\*, where C* is a (computable) 
constant; || ■ ||* is a certain norm, depending on the 
problem; and R(Uh) = f - Luh is the (computable) 
residual, which measures the extent to which the com- 
puted numerical solution Uh fails to satisfy the PDE 
Lu = f. In the third step, on the basis of the a pos- 
teriori error bound, selected triangles in the triangula- 
tion are marked as having an inadequate size (i.e., too 
large or too small, relative to a fixed local tolerance, 
which is usually chosen as a suitable fraction of the 
prescribed overall tolerance, TOL). Finally, in the fourth 
step, the marked triangles are refined or coarsened, 


as the case may be. This four-step adaptive loop is 
repeated either until a certain termination criterion is 
reached (e.g., C*||.R(wj I )||* < TOL) or until the com- 
putational resources are exhausted. A similar adaptive 
loop can be used in p-adaptive FEMs, except that the 
step REFINE is then interpreted as adjustment (i.e., an 
increase or decrease) of the local polynomial degree, 
which may then vary from triangle to triangle instead 
of being a fixed integer over the entire triangulation. It 
is also possible to combine different adaptive strate- 
gies; for example, simultaneous h- and p-adaptivity 
is referred to as hp -adaptivity, thanks to the sim- 
ple communication at the boundaries of adjacent ele- 
ments in the subdivision of the computational domain, 
hp-adaptivity is particularly easy to incorporate into 
DGFEMs (see figure 3). 


5 Finite-Volume Methods 


Finite-volume methods have been developed for the 
numerical solution of PDEs in divergence form, such 
as conservation laws [H.6] that arise in continuum 
mechanics. Consider, for example, the following system 
of nonlinear PDEs: 

— + V-/(M) = 0, (17) 

at 

where u := (ui, . . . , u n ) T is an n-component vector 
function of the variables t ^ 0 and x := (xi, . . . , xj ) ; 
the vector function /(w) := (fi(u),...,fd(u)) T is the 
corresponding flux function. The PDE (17) is supple- 
mented with the initial condition u(0,x) = uo(x), 
x £ Suppose that has been tessellated into 
disjoint closed simplexes k (intervals if d = 1, trian- 
gles if d = 2, and tetrahedrons if d = 3), whose union 
is the whole of and such that each pair of dis- 
tinct simplexes from the tessellation is either disjoint 
or has only closed simplexes of dimension less than 
or equal to d - 1 in common. In the theory of finite- 
volume methods the simplexes k are usually referred 
to as cells (rather than elements). For each particular 
cell i< in the tessellation of R d the PDE (1 7) is integrated 
over k, which gives 

f ^ dx + f V ■ f(u) dx = 0. (18) 

JK Ot J k 

By defining the volume average 


u K (t) := 7—— f u(t,x) dx, t> 0, 

kl h 

where \k\ is the measure of k, and applying the diver- 
gence theorem, we deduce that 


“it 1 + rr 4 f(u)-vdS 


dt | #c| JdK 


0 , 
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Figure 3 An hp-adaptive finite-element grid: (a) in a discon- 
tinuous Galerkin finite-element approximation of the com- 
pressible Euler equations of gas dynamics with local poly- 
nomial degrees ranging from 1 to 7; and (b) the approximate 
density on the grid. (Figure courtesy of Paul Houston.) 

where 3 k is the boundary of i< and v is the unit outward 
normal vector to 3 k. 

In the present construction the constant volume aver- 
age is assigned to the barycenter of a cell, and the 
resulting finite-volume method is therefore referred to 
as a cell-center finite-volume method. In the theory of 
finite-volume methods the local region k over which the 
PDE is integrated is called a control volume. In the case 
of cell-center finite-volume methods, the control vol- 
umes therefore coincide with the cells in the tessella- 
tion. An alternative choice, resulting in vertex-centered 
finite-volume methods, is that for each vertex in the 
computational grid one considers the patch of cells sur- 
rounding the vertex and assigns to the vertex a con- 
trol volume contained in the patch of elements (e.g., 


in the case of d = 2, the polygonal domain defined 
by connecting the barycenters of cells that surround 
a vertex). 

Thus far, no approximation has taken place. In order 
to construct a practical numerical method, the inte- 
gral over 3 k is rewritten as a sum of integrals over all 
(d - 1 (-dimensional open faces contained in 3k, and 
the integral over each face is approximated by replac- 
ing the normal flux f(u) ■ v over the face, appearing 
as the integrand, by interpolation or extrapolation of 
control volume averages. This procedure can be seen 
as a replacement of the exact normal flux over a face of 
a control volume with a numerical flux function. Thus, 
for example, denoting by e K \ the (d - 1) -dimensional 
face of the control volume k that is shared with a 
neighboring control volume A, we have that 

I f(u)-vdS~ X 3 k\(u k ,u a ), 

JdK A:c kA c3k 

where the numerical flux function g K \ is required to 
possess the following two crucial properties. 

Conservation ensures that fluxes from adjacent con- 
trol volumes that share a mutual interface exactly 
cancel when summed. This is achieved by demanding 
that the numerical flux satisfies the identity 

0 k\(u,v) = -g\ K (v,u) 

for each pair of neighboring control volumes k and 

A. 

Consistency ensures that, for each face of each con- 
trol volume, the numerical flux with identical state 
arguments reduces to the true total flux of that same 
state passing through the face, i.e., 

g K \(u, u) = f(u) ■ vdS 

Je K A 

for each pair of neighboring control volumes k and 
A with common face e K \ := k n A. 

The resulting spatial discretization of the nonlin- 
ear conservation law is then further discretized with 
respect to the temporal variable t by time stepping, in 
steps of At, starting from the given initial datum uo, the 
simplest choice being to use the explicit Euler method. 

The historical roots of this construction date back 
to the work of Sergei Godunov in 1959 on the gas 
dynamics equations; Godunov used piecewise-constant 
solution representations in each control volume with 
value equal to the average over the control volume 
and calculated a single numerical flux from the local 
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solution of the Riemann problem posed at the inter- 
faces. A Riemann problem (named after Bernhard Rie- 
mann) consists of a conservation law together with 
piecewise-constant data having a single discontinuity. 
Additional resolution beyond the first-order accuracy 
of the Godunov scheme can be attained by reconstruc- 
tion/recovery from the computed cell averages (as in 
the MIJSCL (monotonic upstream-centered scheme for 
conservation laws) scheme of Van Leer, which is based 
on piecewise-linear reconstruction, or by piecewise 
quadratic reconstruction, as in the piecewise parabolic 
method of Colella and Woodward) by exactly evolv- 
ing discontinuous piecewise-linear states instead of 
piecewise-constant states, or by completely avoiding 
the use of Riemann solvers (as in the Nessyahu-Tadmor 
and Kurganov-Tadmor central difference methods). 

Thanks to their built-in conservation properties, 
finite-volume methods have been widely and success- 
fully used for the numerical solution of both scalar non- 
linear conservation laws and systems of nonlinear con- 
servation laws, including the compressible Euler equa- 
tions of gas dynamics. There is a satisfactory conver- 
gence theory of finite-volume methods for scalar mul- 
tidimensional conservation laws; efforts to develop a 
similar body of theory for multidimensional systems 
of nonlinear conservation laws are, however, hampered 
by the incompleteness of the theory of well-posedness 
for such PDE systems. 

6 Spectral Methods 

While finite-difference methods provide approximate 
solutions to PDEs at the points of the chosen com- 
putational grid, and finite-element and finite-volume 
methods supply continuous or discontinuous piece- 
wise polynomial approximations on tessellations of 
the computational domain, spectral methods deliver 
approximate solutions in the form of polynomials of a 
certain fixed degree, which are, by definition, smooth 
functions over the entire computational domain. If 
the solution to the underlying PDE is a smooth func- 
tion, a spectral method will provide a highly accurate 
numerical approximation to it. 

Spectral approximations are typically sought as lin- 
ear combinations of orthogonal polynomials [11.29] 
over the computational domain. Consider a nonempty 
open interval (a, h) of the real line and a nonnegative 
weight function w, which is positive on (a,b), except 
perhaps at countably many points in (a, b), and such 


that 

rb 

w(x)\x\ k dx < oo Vfc G {0, 1, 2, . . . }. 

J a 

Furthermore, let L^(a,b) denote the set of all real- 
valued functions v defined on (a, b) such that 

a b \ 1/2 

w(x)\v(x)\ 2 dxj <oo. 

Then || • || w is a norm on Lj, (a, b), induced by the inner 
product ( u,v) u , := j a w(x)u(x)v(x) dx. We say that 
{Pk }£L 0 is a system of orthogonal polynomials on (a, b) 
if Pk is a polynomial of exact degree k and (P m ,P n )w = 
0 when m * n. For example, if (a,b) = (—1,1) and 
w(x) = (l-x)“(l+x)^, with a, f e (-1,1) fixed, then 
the resulting system of orthogonal polynomials are 
the Jacobi polynomials, special cases of which are the 
Gegenbauer (or ultraspherical) polynomials (cx = f e 
(-1, 1)), Chebyshev polynomials of the first kind (« = 
= - j), Chebyshev polynomials of the second kind 
(of = f = ^), and Legendre polynomials (cx = f = 0). On 
a multidimensional domain Q c R d , d ^ 2, that is the 
Cartesian product of nonempty open intervals (at, bk), 
k = 1 , ... ,d, of the real line and a multivariate weight 
function w of the form w(x) = uq(xi) ■ ■ ■ Wd(xd), 
where x = (X\ , . . . ,Xd) and u>k is a univariate weight 

function of the variable Xk e (at,bfc), k = 1 d, 

orthogonal polynomials with respect to the inner prod- 
uct (-, -) w defined by ( u,v) w = f n w(x)u(x)v(x) dx 
are simply products of univariate orthogonal polyno- 
mials with respect to the weights Wk, defined on the 
intervals (at, bk), k = 1, . . . , d, respectively. 

Spectral Galerkin methods for PDEs are based on 
transforming the PDE problem under consideration 
into a suitable weak form by multiplication with a test 
function, integration of the resulting expression over 
the computational domain Q, and integration by parts, 
if necessary, in order to incorporate boundary con- 
ditions. As in the case of finite-element methods, an 
approximate solution un to the analytical solution u 
is sought from a finite-dimensional linear space Sn c 
L j,(i2), which is now, however, spanned by the first 
(N + l) d elements of a certain system of orthogonal 
polynomials with respect to the weight function w. The 
function un is required to satisfy the same weak for- 
mulation as the analytical solution, except that the test 
functions are confined to the finite-dimensional linear 
space Sn- In order to exploit the orthogonality proper- 
ties of the chosen system of orthogonal polynomials, 
the weight function w has to be incorporated into the 
weak formulation of the problem, which is not always 
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easy, unless of course the weight function iv already 
appears as a coefficient in the differential equation, or if 
the orthogonal polynomials in question are the Legen- 
dre polynomials (since then w(x) = 1). We describe 
the construction for a uniformly elliptic PDE subject to 
a homogeneous Neumann boundary condition: 

-A u+u = f(x) x 6 Q := (-1, l) d . 


where / e L 2 (Q) and v denotes the unit outward nor- 
mal vector to SO (or, more precisely, to the (d - 1)- 
dimensional open faces contained in 512). Let us con- 
sider the finite-dimensional linear space 

Sn ■— spanflo, := L a 1 ■ ■ ■ L ad : 

0 ^ ^ N, k = 1 d}, 

where L ak is the univariate Legendre polynomial of 

degree cx ^ of the variable Xu e (—1,1), k = 1 d. 

The Legendre-Galerkin spectral approximation of the 
boundary-value problem is defined as follows: find 
um g Sn such that 

B(u n ,vn) = 4(vn) VvneSn, (19) 

where the linear functional f(-) and the bilinear form 
B ( ■ , ■ ) are defined by 

f(v) := f fv dx 
Jn 

and 

B(w, v) := (Vw ■ Vv + icv) dx, 

Jn 

respectively, with iv, v e H 1 (D).AsB(-,-)isa symmet- 
ric bilinear form and Sn is a finite-dimensional linear 
space, the task of determining ujv is equivalent to solv- 
ing a system of linear algebraic equations with a sym- 
metric square matrix A e R KxK with K := dim(5jv) = 
(JV + l) d . Since B(V,V) = ||V || \ > 0 for all V e S N \ {0}, 
where, as before, || ■ ||i denotes the H 1 (13)-norm, the 
matrix A is positive-definite, and therefore invertible. 
Thus we deduce the existence and uniqueness of a solu- 
tion to (19). Cea’s lemma (see (16)) for (19) takes the 
form 

I u — it tv II i = min ||m - vnIIi- ( 20) 

v n gSn 

If we assume that u e H S (Q), s > 1, results from 
approximation theory imply that the right-hand side 
of (20) is bounded by a constant multiple of iV 1_s ||iT,||j, 
and we thus deduce the error bound 

|| M - UjvIIi ^ C7V 1 ~' ? || ti || J5 , 5 > 1. 


Furthermore, if u e C"(l2) (i.e., all partial derivatives 
of u of any order are continuous on 12), then ||u - UjvIIi 
will converge to zero at a rate that is faster than 
any algebraic rate of convergence; such a superalge- 
braic convergence rate is usually referred to as spectral 
convergence and is the hallmark of spectral methods. 

Since un e Sn, there exist U a e R, with multi-indices 
a = (cxi, . . . , oid) £ {0, . . . ,N} d , such that 

Un(x) = X U a L a (x). 

ae{0,...,N} d 

Substituting this expansion into (19) and taking vn = 
Lp, with p = (Pi,..., fid) £ {0, . . .,N} d , we obtain the 
system of linear algebraic equations 

X B(L a ,Lp)U 0l = £(Lp), pe{0 N} d , 

ae{0,...,N} d 

(21) 

for the unknowns U a , a e {0, . . . ,N} d , which is 
reminiscent of the system of linear equations (14) 
encountered in connection with finite-element meth- 
ods. There is, however, a fundamental difference: 
whereas the matrix of the linear system (14) was sym- 
metric, positive-definite, and sparse, the one appearing 
in (21) is symmetric, positive-definite, and full. It has to 
be noted that because 

B(L a ,Lp) = VL« ■ VLp dx + L a Lp dx, 

in order for the matrix of the system to become diag- 
onal, instead of Legendre polynomials one would need 
to use a system of polynomials that are orthogonal in 
the energy inner product ( u,v)b '■= B(u,v) induced 
by B. 

If the homogeneous Neumann boundary condition 
considered above is replaced with a 1-periodic bound- 
ary condition in each of the d coordinate directions 
and the function / appearing on the right-hand side 
of the PDE -An + u = fix) on 12 = (0,1)^ is a 1- 
periodic function in each coordinate direction, then one 
can use trigonometric polynomials instead of Legen- 
dre polynomials in the expansion of the numerical 
solution. This will then result in what is known as 
a Fourier-Galerkin spectral method. Because trigono- 
metric polynomials are orthogonal in both the L 2 (Q) 
and the H 1 (12) inner product, the matrix of the result- 
ing system of linear equations will be diagonal, which 
greatly simplifies the solution process. Having said this, 
the presence of (periodic) nonconstant coefficients in 
the PDE will still destroy orthogonality in the associ- 
ated energy inner product ( ■ , ■ )js, and the matrix of the 
resulting system of linear equations will then, again, 
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become full. Nevertheless, significant savings can be 
made in spectral computations through the use of fast 
transform methods, such as the fast Fourier transform 
or the fast Chebyshev transform, and this has con- 
tributed to the popularity of Fourier and Chebyshev 
spectral methods. 

Spectral collocation methods seek a numerical solu- 
tion un from a certain finite-dimensional space Sn, 
spanned by orthogonal polynomials, just as spectral 
Galerkin methods do, except that after expressing Un 
as a finite linear combination of orthogonal polyno- 
mials and substituting this linear combination into the 
differential equation Lu = f under consideration, one 
demands that Lu^(xk) = f(xk ) at certain carefully 

chosen points Xk, k = 1 K, called the collocation 

points. Boundary and initial conditions are enforced 
analogously. A trivial requirement in selecting the col- 
location points is that one ends up with as many equa- 
tions as the number of unknowns, which is, in turn, 
equal to the dimension of the linear space Sn- 
We illustrate the procedure by considering the para- 
bolic equation 

d t u-d xx u = 0, (t,x) 6 (0, oo ) x (-1, 1), 

subject to the initial condition u{ 0,x) = u o(x) with 
x G [-1,1] and the homogeneous Dirichlet bound- 
ary conditions u(t, - 1) = 0, u(t, 1) = 0, t G (0, oo). 
A numerical approximation un is sought in the form 
of the finite linear combination 

N 

u N (t,x ) = X a k (t)Tk(x) 
k = o 

with (t,x) G [0, oo) x [-1, 1], where 

Tk(x) := cos(karccos(x)), x G [-1, 1], 

is the Chebyshev polynomial (of the first kind) of degree 
k ^ 0. Note that there are N + 1 unknowns, the coeffi- 
cients ak(t), k = 0, 1, ... ,1V. We thus require the same 
number of equations. The function un is substituted 
into the PDE and it is demanded that, for t e (0, oo) 
and k = 1, . . . ,N - 1, 

3 tUN(t,Xk) ~ d xx UN(t,Xk) = 0. 

It is further demanded that ux(t,-l) = 0 and 
uat( 1, 1) = Ofor t G (0, oo) and that UN(0,Xk ) = uo(xk) 

for k = 0 TV, where the (TV + 1) collocation points 

are defined by Xk '■= cos(fcrr/TV), k = 0 TV; these 

are the (TV + 1 ) points of extrema of 7 jv on the interval 
[-1, 1]. By writing u k {t) := u,N(t,Xk), after some cal- 
culation based on properties of Chebyshev polynomials 


one arrives at the following set of ordinary differential 
equations: 

dlA ^ = X (D 2 N ) k iu l (t), k = 1 AT- 1, 

at i=i 

where D is the spectral differentiation matrix of second 
order, whose entries {Df])ki can be explicitly calculated. 
One can then use any standard numerical method for a 
system of ordinary differential equations to evolve the 
values u k (t) = uu(t,Xk) of the approximate solution 
un at the collocations points Xk, k = 1, ... ,1V - 1, con- 
tained in (-1,1), from the values of the initial datum 
uo at the same points. 

7 Concluding Remarks 

We have concentrated on four general and widely appli- 
cable families of numerical methods: finite-difference, 
finite-element, finite-volume, and spectral methods. For 
additional details the reader is referred to the books 
listed below and to the rich literature on numeri- 
cal methods for PDEs for the construction and analy- 
sis of other important techniques for specialized PDE 
problems. 
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IV. 14 Applications of Stochastic 
Analysis 

Bernt 0ksendal and Agnes Sulem 


1 Introduction 

Stochastic analysis is a relatively young mathemati- 
cal discipline, characterized by a unification of prob- 
ability theory > (stochastics) and mathematical analysis. 
Key elements of stochastic analysis are, therefore, inte- 
gration and differentiation of random/stochastic func- 
tions. Stochastic analysis is currently applied to a wide 
variety of applications in a number of different areas, 
including finance, physics, engineering, biology, and 
also within other fields of mathematics itself. We can 
mention only some of these applications here, and we 
refer the reader to the list of further reading at the end 
of the article for more information. 

Probability theory and mathematical analysis were 
traditionally disjoint mathematical disciplines that had 
little or no interaction. In 1923 Norbert Wiener gave 
a mathematically rigorous description of the erratic 
motion of pollen grains in water, a phenomenon that 
was observed by Robert Brown in 1828. This motion, 
called Brownian motion, was modeled by Wiener as a 
stochastic process called the Wiener process, denoted 
by W(t) = Wit, to) or Bit) = B t = B t (w) = B(t,to). 
Heuristically, one could say that the position of a pollen 
grain at time t is represented by Bit, in). Here, at is a 
“scenario” parameter that could represent, for exam- 
ple, a pollen grain or an experiment, depending on the 
model setup. 

Wiener showed that t — ■ Bit, in) is continuous but 
nowhere differentiable for almost all wed (i.e., for 
all to except on a set of P-measure zero, where P is the 
probability law of B(-)). Nevertheless, he showed that 
it is possible to define what was later called the Wiener 


integral, namely, 

I(f)lin) := C f(t)dB(t,co) (T > 0 fixed), (1) 
Jo 

as a square-integrable element with respect to P for 
all deterministic square-integrable functions / with 
respect to the Lebesgue measure on [0,T]. This con- 
struction represents the first combination of proba- 
bility and analysis, and as such marks the birth of 
stochastic analysis. 

Subsequently, in 1942 Kiyoshi Ito extended Wiener’s 
construction to include stochastic integrands c p{t,in) 
that are adapted (to jFr), in the sense that for each t 
the value of c p(t,io) can be expressed in terms of the 
history ft of Bit, in), i.e., in terms of previous values of 
Bis, iu), s ^ t. He showed that in this case the integral 
can in some sense be represented as a limit of Riemann 
sums: 

rT N 

qp(t, in) dBlt, in) = lim £ <p(ti +1 )(Bt i+1 - B ti ) (2) 

Jo n~°° 

(0 = to < fi < ■ ■ ■ < tu = T being a partition of [0, T]) 
if c p satisfies 

rT 

cp 2 lt, in) dt < oo a.s., 

Jo 

and then the limit in (2) exists in a weak sense (in prob- 
ability with respect to P; “a.s.” is an abbreviation for 
“almost surely,” i.e., with probability 1). 

Ito proceeded to study such integrals and their prop- 
erties in a series of papers in the 1950s. He proved 
a useful chain rule, now called the Ito formula, for 
processes that are sums of Ito integrals and integrals 
with respect to Lebesgue measure. Such processes are 
today known as ltd processes. Moreover, he studied 
corresponding stochastic differential equations of the 
form 

dA(t) = blt,Xlt)) dt + u(t, X(t)) dBt, 0 ^ t ^ T,1 
X(0)=xel, | 

(3) 

where k [0,T]xI - 1 and cr: [0,T] x R — ■ R are 
given functions satisfying certain conditions, and he 
proved the existence and uniqueness of solutions XI ■ ) 
of such equations, later known as (Ito) stochastic differ- 
ential equations (SDEs). Note that (3) is just a shorthand 
notation for the stochastic integral equation 

Xlt) = x + f bis, X(s)) d5 + f cr(s,X(s)) dB s , 

Jo Jo 

O^t^T. (4) 

For several years the work of Ito was considered an 
interesting theoretical construction, but one without 
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any applications. But this changed in the late 1960s and 
early 1970s when Henry McKean published a book with 
a useful introduction to the stochastic analysis of Ito 
and Samuelson proposed modeling the price of a stock 
in a financial market by the solution X(t) of an SDE of 
the form 

d X(t) = txX(t) dt + px(t) d B(t), 0 s: t s: T, 

A'(O) = x > 0, 

with f > 0 and « constants. This was the beginning 
of a wide range of applications of stochastic analysis. 
The amazing subsequent development of the subject 
can be seen as the result of fruitful interplay between 
applications and theory. We think this strong two- 
way communication between problems and concepts 
in applications and the development of corresponding 
mathematical theory is unique to stochastic analysis. 

In the following sections we will briefly review some 
examples of this interaction. 

2 Applications to Finance 

The application of stochastic analysis to finance is 
undoubtedly one of the most spectacular examples of 
the application of mathematics in our society. Before 
stochastic analysis entered the field, finance was, with 
a few exceptions, an area almost devoid of mathemat- 
ics. Today, though, finance is strongly influenced by 
the theory and methods of stochastic analysis. Here we 
give just a brief introduction to this area of applica- 
tion, while we refer to the articles on financial math- 
ematics [V.9] and portfolio theory [V.10] elsewhere 
in this volume for more information on this huge and 
active research topic. 

The breakthrough came in the late 1960s/early 
1970s, when 

• Jan Mossin studied a discrete-time financial model 
and solved the problem of finding the portfolio 
that maximizes the expected utility of the terminal 
wealth, using dynamic programming, 

• Robert Merton used the (continuous-time) finan- 
cial market model of Samuelson (see above) and 
solved the optimal-portfolio problem by using the 
Hamilton-Jacobi-Bellman equation, and 

• Fischer Black and Myron Scholes developed their 
famous Black-Scholes option-pricing formula. 

Merton and Scholes were awarded the Nobel Memorial 
Prize in Economic Sciences for these achievements in 
1997 (Black died in 1995). 



To see why stochastic analysis is such a natural tool 
for finance, let us consider a simple financial market 
model with two investment possibilities. 

(1) A risk-free investment with constant unit price 

Soft) = 1. 

(2) A risky investment, with unit price Si(t) given by 

dSi(t) = Si(f)[«dt + pdB(t)], t ^ 0, 

Si(0) > 0, 

as in Samuelson’ s model. 

If c p(t) = (cpo(t), <Pi (t)) is a portfolio, giving the num- 
ber of units held at time t of the risk-free and risky 
investments, respectively, then 

XV(t) := qpo(t)So(t) + <pi(t)5i(t) 

is the total value of the portfolio at time t. We say that 
the portfolio is self-financing if the infinitesimal change 
in the wealth at a given time t, d X v It), comes from the 
change in the market prices only, i.e., if 

dA'^U) = qpo(t) dSo(t) + cpi(f) dSi(t). 

We assume from now on that all portfolios are self- 
financing. Since dSo(t) = 0, this gives the integral 
representation 

Xx (t) = x + f <pi(5) dSi(5) 

Jo 

= x+ f qpi(s)aSiis) d5 + f (pi(s)PSi(s) dB s , 
Jo Jo 

( 6 ) 

where x is the initial value and the last integral is the 
Ito integral. 

Note that this integral representation is natural 
because of the interpretation (2) of the Ito integral as 
a limit of Riemann sums, where the portfolio choice 
qpi(ti) is taken as the left-hand side of the partition- 
ing intervals [tj.fj+i) of [0,T]. Heuristic ally, first we 
decide on the portfolio, then comes the price change. 
With this mathematical setup the breakthroughs of 
Mossin and Merton can be formulated more precisely, 
as follows. 

2.1 The Portfolio-Optimization Problem 
(Mossin/Merton) 

Given a utility function (i.e., a continuous, increasing, 
concave function) U: [0, oo) — M, find a portfolio c p* e 
JA (the family of admissible portfolios) such that 

supE[U(X < P(T))]=E[U(X‘P t (T))], (7) 

<peJ4 
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where T > 0 is a given terminal time and E[ ■ ] denotes 
expectation with respect to P. This may be regarded as 
a stochastic control problem. With specific choices of 
utility functions, e.g., U(x) = lnx or U(x) = (lly)x y 
for some constant y e (— oo, 1 ) \ {0} , this problem 
can be solved explicitly by dynamic programming, as 
Merton did (see section 4). 

2.2 The Option-Pricing Problem (Black-Scholes) 

Suppose that at time t = 0 you are offered a contract 
that pays (S(T) - K) + (= max{5(T) - K, 0}) at a speci- 
fied future time T. Here, K > 0 is a constant, also speci- 
fied in the contract. Such a contract is called a European 
call option , and K is called the exercise price. This con- 
tract is equivalent to a contract that gives the owner the 
right, but not the obligation, to buy one unit of the risky 
asset at time T at the price K. To see this equivalence, 
we argue as follows. If the market price S(T) turns out 
to be greater than K, then according to the second con- 
tract the owner can buy one unit at the price K and then 
immediately sell it at S(T), giving a profit of S{T) - K. 
If, however, S(T) ^ K, the contract is worthless and the 
payoff is 0. This leads to the contract payoff ( S(T ) -K) + 
in general, which is in agreement with the first contract. 
The question is, then, how much would you be willing 
to pay at time t = 0 for such a contract? If you are a 
careful person, you might argue as follows. 

If I pay an amount z for the contract, I start with an ini- 
tial wealth -z, and with this initial wealth it should be 
possible for me to find a portfolio <p e A (in particu- 
lar, a self-financing portfolio) such that the sum of the 
corresponding terminal wealth X - Z (T) and the payoff 
of the contract is, almost surely, nonnegative. 

This gives the following expression for the buyer’s price 
of a call option: 

Pbuyer := sup{z; there exists qp e JA such that 

X- Z (T) + ( S(T)-K) + > 0 a.s.}. 

This problem of finding the option price Pbuyer is of 
a different type from the Merton problem in (7). The 
Black-Scholes option-pricing formula states that 

Pbuyer = E Q [(S(T) - K) + ], (8) 

where Eq means expectation with respect to the mea- 
sure Q, defined to be the unique probability measure 
Q equivalent to P such that S\(t) is a martingale under 
Q, i.e., 

£q[SiU) I f s ] =SAs) 
for all 0 ^ s ^ t ^ T. 


Such a measure is called an equivalent martingale 
measure. In this financial market there is only one 
equivalent martingale measure, and it is given by 

dQ(cu) = exp ^ - °^B(T) - e) dP(co) on 

Substituting this into (8) and using the probabilistic 
properties of Brownian motion we get an explicit for- 
mula for the call option price Pbuyer in terms of K, (1, 
and T. A surprising feature of this formula is that it 
does not contain «, and we are therefore led to the 
important conclusion that the option price does not 
depend on the drift coefficient a. 

The ground-breaking results in sections 2.1 and 2.2 
motivated a lot of research activity. Other models were 
introduced and studied, and at the same time other 
areas of applications were found (see section 7). 

We now proceed to present some applications that 
are not necessarily connected to finance. 

3 Backward Stochastic Differential Equations 

Returning to the equation (6) for the wealth X% ( t ) cor- 
responding to a given (self-financing) portfolio qp(t) = 
(qpo(t),qpi(t)), one may ask the following question. 
Given a random variable F(a>), assumed to depend only 
on the history of B(t, to) up to time t, does there 
exist a portfolio qp such that Xx (T) = F a.s.? 

If we substitute 

Z(t) := qpiit)S(t) 

into (6), we see that this question can be formulated as 
follows. Given F, find X(t) and Z(t) such that 

dX(t) = ^Z(t)dt + Z(t)dB(t), O^t^T, 

X(T) = F a.s. 

This is called a backward stochastic differential equa- 
tion (BSDE) in the two unknown (adapted) processes X 
and Z. In contrast to (3), it is the terminal value F of X 
that is given, not the initial value. More generally, given 
a function g(t,y,z,io): [0,T] xlxlxfl — R and a 
random variable F, as above, the corresponding BSDE 
in the two unknown (adapted) processes Y(t),Z(t) has 
the form 

d Y{t) = -g(t, Y (t), Z(t), co) dt + Z(t) d B t , O^t^T, 
Y(T) = F a.s. 

The process g is sometimes called the driver. 

The original motivation for studying BSDEs came 
from stochastic control theory (see section 4), but a 
number of other developments have since been found, 
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such as the probabilistic representations of solutions of 
nonlinear parabolic partial differential equations devel- 
oped by Pardoux and Peng in 1990. In the 1990s, Duffle 
and Epstein used BSDEs to introduce the concept of the 
recursive utility of a consumption process, and around 
ten to fifteen years ago BSDEs were used to define con- 
vex risk measures as a model for the risk of a financial 
position. 

4 Optimal Stochastic Control 

The portfolio-optimization problem presented in sec- 
tion 2 may be regarded as a special case of a gen- 
eral optimal stochastic control problem of the following 
type. Suppose that the state X(t) = X s , x (t) of a sys- 
tem at time t is described by a stochastic differential 
equation of the form 

dA(t) = b(t, A(t), u(t), to) dt + alt, X(t),u(t), co) dB t 

(9) 

for 5 ^ t ^ T, with X(s) = x, (x e R and 0 ^ 5 ^ T 
given), where b\ [0, T] x R x ¥ x 12 — R and a: [0, T] x 
R x ¥ x Q — ■ R are given functions such that b ( ■ , x, v , ■ ) 
and a(-,x,v, ■) are adapted processes for each given 
x e R and v e ¥, a given set of control values. The 
process u(t) = u(t,co) is our control process. To be 
admissible it is required that u be adapted and that 
u(t, co) e ¥ for all t e [0, T] and almost all co e 12. 
The set of admissible control processes is denoted by 
JA. With each choice of w 6 JA, (s,x) 6 [0, T] x R, we 
associate a performance J u (s,x) given by 

J u (s,x ) = J f(t,X S ' X (t),u(t),a>)dt 

+ g(X StX (t),co)^, (10) 

where / : [0, T] x R x ¥ x 12 — R and g : R x 12 — ■ R 
are given functions. Sometimes / is called the profit 
rate and g is called the bequest or salvage function. 
We assume that f(-,x, v, ■) is adapted and that g(x, ■) 
depends only on the history jFr of the underlying 
Brownian motion B(t,co), s ^ t ^ T, for each x and v. 

The problem is to find a control u* e JA and a 
function <P(s,x) such that 

<P(s,x ) = sup J u (s,x) = J u *(s,x), (s,x) € [0, T] xR. 

u eJA. 

(ID 

Such a process u* (if it exists) is called an optimal 
control, and <P is called the value function. 

For example, if we represent the portfolio in (6) by 
the fraction n(t) of the total wealth X(t) invested in 


the risky asset at time t, then 

ipi(t)Si(t) 

" u) - XU) 
and (6) can be written 

dA'(t) = TT(t)X(t)[adt + /Id B(t)], 0 < t ^ T, 1 

A(0) = x > 0. | 

( 12 ) 

Comparing (12) and (7) with (9)-(ll), we see that the 
optimal-portfolio problem is an example of an optimal 
stochastic control problem, as claimed. 

There are two main solution methods for stochastic 
control problems: the dynamic programming principle 
and the maximum principle. 

4.1 The Dynamic Programming Principle 

The dynamic programming principle, introduced by 
R. Bellman in the 1950s, applies only to Markovian 
problems. These are problems where the coefficients 
b(s,x,v), cr(s,x,v), f(s,x, v), and g(x), for fixed val- 
ues of 5, x, and v, do not depend on co. Moreover, the 
control process u(t) must be Markovian, in the sense 
thatu(t) = uo(t,A(t))forsomedeterministicfunction 
u o : [0, T]xR^ ¥. (Such a control is called a feedback 
control.) In this case, the dynamic programming prin- 
ciple leads to the Hamilton- Jacobi-Bellman equation, 
which (under some conditions) states the following. 
Suppose that cp is a smooth function such that 

f(s,x,v) +A v qp(s,x) ^ 0, 0 ^ 5 ^ T, x e R, v e ¥, 

and that 

<■ p(T,x) = g (x ) . 

Then 

qp(s,x) ^ <P(s,x), (s,x) e [ 5 , T] x R. 

Moreover, assume that, in addition, for all (s,x) e 
[0, T] x R there exists uo(s,x) e ¥ such that 

f(s,x, uo(5,x)) + A u ° {s ’ x) qp(s,x) = 0. (13) 

Then 

qp(s,x) = <P(s,x), (j,x)e[s,T]xR 

and 

u*(t) := u°(t,X(t)) 

is an optimal (Markovian) control. 

Here, A v denotes the generator of the Markov pro- 
cess Xf x (t) obtained by using the constant control 
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u(f) := v e ¥. It is a parabolic second-order partial First, we define the associated Hamiltonian H\ R 5 ^ 

differential operator given by R by 


A l ’qp(s,x ) := 4r-(s,x) + b(s,x, v)^-(s,x) 

Os ox 

1 2 / ,3 2 <p . 

+ 2 <t (i,X, v)y- 2-(5,X). 

For example, for the problem (7) with (12) we get 


A l ’qp(s,x ) := ^r~(s,x) + aux^h.x) 
ds 3x 

+ b P 2 v 2 x 2 (s,x, v) ^ ^ (s.x), 
dx 2 

which is maximal when 

* = «<?'(*■*) 

P 2 xqp" (s,x) ’ 

where 

qp'(s,x) = ^(s,x) and qp”(s,x) = ^y(s,x). 
Substituting into (13) we get 

Sqp ( ot 2 (c p') 2 (s,x) = 

3s V ' 2/3>"(s,x) 

In particular, if 

l/(x) = q(x) = — x y , x > 0, 

y 

for some parameter y G (-co,l)\0, then we see that 
the above equation holds for 


J(t,x,u,p,q) = f(t,x,u) + b(t,x,u)p + a (t,x,u)q. 

(14) 

Here, p, q are adjoint variables, somehow related to 
Lagrange multipliers. 

Next, associated with the Hamiltonian we consider, 
for given m G 34, the BSDE 
dH 

d p(t) = ~^—(t,X(t), u(t) , p (t) , q(t)) dt + q(t) d B t , 
ox 

O^t^T, 

p{T) =g'(X(T)) 

(15) 

in the two unknown adjoint processes p(t) = p (u) (t), 
q(t) = q (u) (t), where X(t) = X {u) (t) is the solution of 
(9) corresponding to the control u. 

The maximum principle relates the maximization of 
J u (s, x) in (10) to the maximization of the Hamilto- 
nian. For example, under some concavity assumptions 
we have the following result. 

The sufficient maximum principle (assumes concav- 
ity). Suppose that u G 34, with associated solution 
X(t) = X M (t), p(t) = p^ u) (t), and q(t) = q (u) (t) 
of the forward-backward SDE system (9), (15). Suppose 
that for each t, u = u(t) maximizes the Hamiltonian, 
in the sense that 


qp(s,x) = h(s)x y , 


h'(s) - “ 3 his) = 0, O^s^T, 

2 jMy - 1) 

h(T) = 1/y. 

This gives an optimal control (portfolio) 

u*(t) = Tt*(t) = tthWy **" 1 = « 

/? 2 xy(y- l)h(t)x y ~ 2 0 2 (1 — y ) ’ 
which is one of the classical results of Mossin and 
Merton. 


4.2 The Maximum Principle 

The maximum principle goes back to L. Pontryagin and 
his group, who developed this method for determinis- 
tic control problems in the 1950s. It was adapted to 
stochastic control by J.-M. Bismut and subsequently 
extended further by A. Bensoussan, E. Pardoux, S. Peng, 
and others throughout the 1970s, 1980s, and 1990s. 
Basically, the maximum principle approach to the 
stochastic control problem (11) is as follows. 


u n*. H(t,X(t),u,p(t),q(t)) 

is maximal for u = u(t). Then, u(-) is an optimal 
control for the problem (11). 

To illustrate how this result works, let us apply it to 
once again solve the problem (12), (7); in this case, the 
Hamiltonian becomes, with u = tt, 

H(t,x,rt,p,q) = Ttxap + Ttxfiq (16) 

and 

dp(t) = -(ir(t)(xp(t) + ifH)Pq(t)) dt + q(t) dB(t), 
Pit) = U’(X(T)). 

(17) 

Since H is linear in tt, the only possibility for the 
existence of a maximum of 

tt H(t,X(t),TT ,p(t),q(t)) 

is that 

i xp(t ) + fiq(t) = 0, 


i.e., 

cl(t) = -^P(t). (18) 
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Substituting this into (17) we get 

dp(t) = -^p(t) d B(t), 
which has the solution 

Pit) = p(0)exp ( - ^B(t) - t)- (19) 

Now assume, as above, that 


The requirement 


U(x) = -x y . 

y 


p(T) = U' (X(T)) 


( 20 ) 

( 21 ) 


then gives the equation 

= x y ~ 1 exp ((y - 1) £ ff(t)fdB(t) 

+ (y- 1) J q (if(t)a- |tr 2 (t)^ 2 )dtj. 

( 22 ) 


From (22), it is natural to try ft such that 
(y- l)7T(t)/3 = 


i.e., 


7T(t) = 


a 

0 2 (i -y)' 


(23) 


Since, by (19), p(f) is a martingale, from (21) we obtain 
p(0) = E[U' (X(T))], and substituting this into (22) we 
verify that (22) holds with ft as in (23). 

This confirms the result we found in section 4.1. 


5 Optimal Stopping 

To describe the problem of optimal stopping of a 
stochastic process S(t), we first need to explain the 
crucial concept of a stopping time. 

A random time t: Q — [0, oo] is called a stopping 
time (with respect to the history of S( ■ )) if, for any time 
t, the decision of whether or not to stop at time t is 
based only on the history of S(s) for 5 ^ t. 

For example, if S(t) is the position at time t of a car 
driving along a road in a city, then 

t i := the first time we come to a traffic light 

will be a stopping time because this instant can be 
decided simply by recording the history of S up to that 
time. On the other hand, if we define 

T 2 := the last time we come to a traffic light, 


then T 2 will not be a stopping time because we would 
need to know the future in order to decide whether or 
not a given traffic light is the last one. 

Thus, since we cannot assume knowledge about the 
future, we see that stopping times are the natural ran- 
dom times to consider in applications: the optimal time 
to sell (or buy) an object with a stochastic price process, 
the optimal time to start a new business, the optimal 
time to close down a factory, and so on. 

If the state Y (t) E R k at time t is described by an SDE 
of the form 

dT(t) = b(Y(t)) dt + cr(Y(t)) dB(t), t ^ 0, 
y( 0) = y e R k , 

then the associated optimal stopping problem is to find 
a stopping time t* (with respect to T(-)), called an 
optimal stopping time, such that 

$(y) := supT^r [ f(Y(t))dt + g(Y(T))] 
teY L Jo J 

= E y [f o f(Y(t)) dt +0 (T(t*))], (25) 

where T is the set of all Y ( - ) stopping times and we 
interpret g(Y(r) ) as 0 if t = co. Flere, / and g are given 
functions, and E y denotes the expectation assuming 
that 7(0) = y. The function <P is called the value 
function of the optimal stopping problem. 

As in stochastic control problems, it often turns out 
that in order to find an optimal stopping time t*, it 
helps to simultaneously try to find the value function <P. 
In fact, under some technical conditions one can prove 
the following. 

The optimal stopping theorem. Let A be the generator 
ofY(-) (see section 4). 

(a) Then 

(a) AT>(y) + fly) ^ 0 for all y and 

(b) <P(y) ^ g{y) for all y. 

Moreover, at all points y, at least one of the two 
inequalities holds with equality. We can therefore 
combine (i) and (ii) into the equation 
ma x{A<P(y) + f(y),g(y) - ‘Ely)} = 0 for all y. 

(26) 

(b) Let D be the continuation region, defined by D = 
{ y ; g(y) < <P(y)}. Then it is optimal to stop the 
first time that Y(t) exits from D, i.e., 

t* = inf { t > 0; Y(t) £ D}. (27) 

(c) Moreover, <P and g meet smoothly at dD, i.e., 

V<P(y) = Vg(y) for y e dD (28) 
(this is called the high contact or smooth fit princi- 
ple). 
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This theorem shows that there is a close connec- 
tion between optimal stopping problems (which are 
stochastic) and variational inequality problems and free 
boundary > problems, which are nonstochastic classical 
analysis problems. 

Regarding variational inequality' problems, the equa- 
tion (26) is an example of a classical variational 
inequality. The operator A and the functions / and 
g are given, and the problem is to find $ such that 
(26) holds. A classical example in which this problem 
appears is the problem of finding the wet region in a 
porous sand wall of a dam. 

A free boundary > problem is a problem of the follow- 
ing type. Given a differential operator A and functions 
/ and g, find a function <P and a domain D such that 

(i) A<P(y) + fly) = 0 for y e D, 

(ii) L> (y) = gly) for y e 3D, and 

(iii) VL>(y) = Vg(y) for y e 3D, 

where V denotes the gradient operator. Applications of 
this type of problem include the problem of finding the 
boundary of a melting ice cube. 

The link between the optimal stopping problem and 
the free boundary problem is provided by the high 
contact principle (28). 

It can also be shown that optimal stopping problems 
are related to reflected BSDEs. In a reflected BSDE we are 
given a driver g as in section 3 as well as a lower barrier 
process 1(f), and we want to find processes Y(t), Z(t ) 
and a nondecreasing process K(t ) (all of them adapted) 
with K(0) = 0 such that the following hold: 

• d Y(t) = -g(Y(t), Z(t), w) dt+Zlt) dBlt)-dK(t), 
and 

• Y(t) ^ Lit) for all t. 

6 Filtering Theory 

Suppose we model the size of a population, e.g., the fish 
population in a lake, by a stochastic logistic differential 
equation of the form 

d X(t) = X(t)[K-Xltm<xdt + PdB{t)), tJsO, (29) 

where a, f, and K > 0 are constants. K is called the 
carrying capacity of the lake; it represents the maxi- 
mal population size that the lake can sustain. Heuristi- 
cally, the “noisy” factor a + fildBlt) /dt) representing 
the nutritional quality of the lake, subject to random 
fluctuations (the quantity dB(t) /dt, called white noise), 
can be rigorously defined as a stochastic distribution. 


Even if there are good theoretical reasons for choos- 
ing such a model, the problem of how to estimate the 
coefficients a, f, and K, and also X(t) itself, remains. 
To simplify matters, let us assume that a, fl, and K are 
known. How do we find Xlt)? 

The problem is that we do not know the precise value 
of X(t 0 ) for any to- To compensate for this, we make 
(necessarily imprecise, or “noisy”) observations Y(t) on 
X(t), i.e., we observe 


Y(t) = X(t) + 


dB 1 (t) 

Y—fi— 


or 

dY(t) = Xlt) df + y dBi(f), t ^ to. 


where y > 0 is a known constant and B i ( - ) is another 
Brownian motion, usually assumed to be independent 
of B(-). 

The filtering problem is as follows: what is the best 
estimate, Xlt), of X(t) based on the observations Y(s), 

5 ^ t? 

By saying that an estimate Xlt) is based on 7(5), 
5 ^ t, we mean that it should be possible to express 
X(t) by means of the values 7(5), s ^ t, only. In other 
words, X(-) should be adapted to the history (filtration) 
{ft }t<o generated by 7(f), t ^ 0. By saying that X(t) 
is the best estimate based on 7, we mean best in the 
sense of minimal mean square error, i.e., 

E[{X(t) - A(t)) 2 ] = inf E[(X(t) - Xlt)) 2 ], (30) 

A'sy 

where \j is the set of all estimates based on 7. The 
solution of the problem (30) can be expressed as 

Xlt) =E[X(t) \fj], (31) 

where the right-hand side denotes the conditional 
expectation of X(t) with respect to f} . The filter- 
ing problem is therefore the problem of finding this 
conditional expectation of Xlt). 

The filtering problem is difficult in general, and an 
explicit solution is known only in special cases. The 
most famous solvable case is the linear case, in which 
we have 


dX{t) = Flt)Xlt) dt + C(t) d Bit) (signal process), 

d7(t) = G{t)X(t) dt + D(t)dBi(t) (observations), 

where F(t), C(t), Git), and D{t) > 0 are known, 
bounded deterministic functions, D]t) being bounded 
away from 0. 
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The Kalman-Bucy filtering theorem then states that 
the best estimate X(t) solves the SDE 


d X(t) = 


G 2 (t)S(t) 

D 2 (t) 



dt 


G(t)5(t) 

D 2 (t) 


d Y(t), 


\ (32) 


X(0) =E[X(0)], 


where S(t) := E[(X(t) - V(t)) 2 ] (the minimal mean 
square error) satisfies the (deterministic) Riccati equa- 
tion 

C ^ = 2F(t)5(t)-|^5 2 (t) + C 2 (t), t^O, 
5(0) = E[(X(0) - E[X(0)]) 2 ]. 

(33) 

Note that in (32), X(t) is indeed expressed in terms 
of the observations Y(s), 5 ^ t, only, assuming that the 
initial distribution of X(0) is Gaussian, with given mean 
ffATO)] and variance 5(0). 


7 Outlook 

In this short article we have been able to discuss only 
the classical stochastic calculus based on Brownian 
motion and some of the well-known (and spectacu- 
lar) applications of this theory. But we should point 
out that in recent years there has been rapid research 
development both in the mathematical foundations of 
stochastic analysis and in further applications of the 
subject. 

• For example, the theory of stochastic integration 
has been extended to other integrator processes, 
such as Levy processes, Poisson random measures, 
and, more generally, semimartingales. 

• Stochastic differential equations and more general 
stochastic integral or stochastic functional differ- 
ential equations, and even stochastic partial dif- 
ferential equations driven by such processes, are 
being studied. In particular, mean-field SDEs are 
used in connection with the modeling of systemic 
risk in finance. 

• Stochastic optimization methods (stochastic con- 
trol, singular control, impulse control, optimal 
stopping) for such systems are being developed 
accordingly, with associated generalized BSDEs. 

• Various ways of representing model uncertainty 
in applications have been introduced and stud- 
ied, including nonlinear expectation theory and the 
theory of G-Brownian motion. 


• An axiomatic approach to the concept of convex 
risk measures has been developed. Such risk mea- 
sures can be represented either by a BSDE or by a 
dual approach, using a family of probability mea- 
sures that are absolutely continuous with respect 
to P. 

• The relationship between performance and avail- 
able information is being studied. In particular, 
what is the optimal performance of a controller 
for which only partial (e.g., delayed) information 
is available? This topic may also combine optimal 
control with filtering. 

• In the opposite direction, anticipative stochastic 
calculus, i.e., stochastic calculus with nonadapted 
integrands, has been developed. Combined with 
the recently developed stochastic calculus of varia- 
tions ( Malliavin calculus) and the Hida white noise 
calculus, this gives an efficient mathematical tool 
for investigating the actions of insiders in financial 
markets. 

• Stochastic differential games involving players 
with asymmetric information is another hot topic 
with obvious applications in several areas, includ- 
ing biology, engineering, investment theory, and 
finance. This is a challenging area of research in 
which the full force of the mathematical machin- 
ery discussed above is needed. 
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IV.15 Inverse Problems 

Fadil Santosa and William W. Symes 


1 What Is an Inverse Problem? 

In inverse problems one is interested in determining 
parameters of a system from measurements. Much of 
applied mathematics is about modeling and under- 
standing phenomena that occur in the world. The mod- 
els are often based on first principles: physical laws, 
empirical laws, etc. A common use of models is for pre- 
diction; if we know all the parameters of a model, we 
can predict how the model will respond to a given exci- 
tation. In inverse problems we perform experiments in 
which the response of a model to prescribed excitations 
is measured. The goal is to determine the unknown 
parameters of the model from the measured data. 

1.1 Background and History 

Perhaps the oldest formally studied inverse problem- 
dating back to the work of Herglotz (1907) and Wiechert 
(1910)— is that of determining properties of the Earth’s 
subsurface from measurement of seismological data. 
This problem falls into the category of geophysical 
inverse problems, which includes techniques for imag- 
ing the near subsurface for deposits of oil and other 
minerals. 

In physics the classical inverse problem involves find- 
ing the potential in the Schrodinger equation from scat- 
tering experiments. Still in the same family are scat- 
tering problems involving geometrical scatterers where 
one is interested in determining the geometrical shape 
from measurements of the acoustic or electromagnetic 
response of the object. In military applications such 
problems fall under the study of sonar and radar 
[VII. 17], 

Medical imaging has been a driving force in the devel- 
opment of inverse problems. The ability to visualize 
and characterize the internal structures of a patient is 
of great value in diagnostics. A very important imaging 
modality, x-ray computed tomography [VII. 9, VII. 19] 
(abbreviated X-ray CT and also known as computer- 
assisted tomography), can be viewed as an inverse 
problem. The conductivities of biological tissues can 


also be a target of medical imaging and could poten- 
tially provide further diagnostic capabilities that com- 
plement other imaging modalities. 

2 Language and Concepts 

One of the most influential researchers in the area of 
inverse problems was Pierre Sabatier, who is responsi- 
ble for formalizing its study. He provided much of the 
vocabulary of the subject, which we describe in some 
detail below. 

2.1 The Forward Map 

In inverse problems, we are interested in determining 
the properties of a system that we call a model, m. The 
model is related to observations that we call data, d. 
To be mathematically precise, we call M the set of all 
possible models under consideration. We would nor- 
mally attach mathematical properties to this set that 
are consistent with the physics. The space of data D 
also needs to be characterized. The forward map is 
the mathematical relationship between a model and its 
associated measurement. Let us indicate this by F(m). 
In an inverse problem we are given data d from which 
we wish to find, to the extent possible, the unknown 
model m. Thus, we wish to “solve” F(m) » d for a 
given d. We use the approximation symbol above 
because noise is inherent in any measurement. In most 
inverse problems it is unlikely that the data d is in the 
range of the forward map for all m in M. 

To illustrate the above concept let us consider the 
inverse problem of X-ray CT. The model in question 
will be X-ray attenuation, described by an attenuation 
coefficient p(x) that is a function of position x. A 
possible space of functions for the model is a set of 
nonnegative functions with an upper bound, namely 
L°° (1 3), where O is the domain being scanned. The 
data are input and output X-ray intensities, /o and /, 
respectively, parametrized by X-ray trajectory L (see 
figure 1), usually given as the logarithm of their ratio: 
log (/(I) /Io(U). The forward map, based on Beer’s law, 
is 

R(p-,Io,L) = - pdf', 

that is, R(p\Io,L) is the line integral of p(x) along 
line L. The inverse problem of a computer-assisted 
tomography scan is to determine p(x) from the data 
log(/(I)//o(D). The found p(x) is often displayed as 
a set of images, aiding in diagnostics. 
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Figure 1 Schematic of a generic X-ray CT setup. The inten- 
sity of the X-ray entering the body is Iq\ the intensity at exit 
is I. The data value is log(///o ) for the ray L. The forward 
map is the line integral of the attenuation coefficient p(x) 
along the line L. The inversion of p (x) depends on the mea- 
surement setup and is particularly straightforward in two 
dimensions, as described in section 3 . 1 . 


2.2 Inversion and Data Fitting 

There are instances where the solution to the equa- 
tion F(m) = d can be written down explicitly. That is, 
we have a formula for m in terms of the given data 
d. An example is when F is a linear operator whose 
inverse can be calculated. Another popular approach 
is to view the inverse problem as a data-fitting prob- 
lem. In this case we seek a model m that minimizes 
the misfit ||F(m) - d\\ . The choice of norm is critical 
and depends on what additional information is avail- 
able about the map F and the data d. The most impor- 
tant information is the regularity of the map and the 
statistical properties of the noise embedded in the data. 

2.3 Linear and Nonlinear Inverse Problems 

In a linear inverse problem, the forward map satisfies 
the property 

F(mi + m2) = F(mi) + F(m2>. 

In practical terms, linearity often means that the prob- 
lem is easy to solve. If both m and d are finite dimen- 
sional, then a linear forward map can be represented 
as a matrix. In this instance, the linear inverse problem 
amounts to a linear system of equations. 

For nonlinear inverse problems, the above relation- 
ship does not hold. Nonlinear inverse problems are usu- 
ally more difficult to analyze and to solve, and iterative 
techniques are often employed for solving them. 


One can linearize a nonlinear inverse problem by 
assuming that the unknown model can be written as 

m = mo + Am, 

where mo is known and Am is small. The concept of 
smallness can be made precise mathematically. Then, 
if the forward map F( ■ ) associated with the problem is 
differentiable, we approximate it by a two-term Taylor 
series: 

F(m) a F(mo) + GAm. 

Here, G is a linear operator representing the derivative 
(Jacobian) of the forward map F( ■ ) . Thus, given data d, 
the linearized inverse problem seeks to determine Am 
from the equation 

GAm = d - F(mo). 

In some situations we can solve a nonlinear inverse 
problem by successively seeking solutions to a se- 
quence of linearized inverse problems. Such an ap- 
proach, when applied to a least-squares formulation, 
may be viewed as a Gauss-Newton method. 

2.4 Ill-Posedness and Ill-Conditioning 

Hadamard defined a mathematical problem as well- 
posed if it has a unique solution that depends contin- 
uously on the data. Another way to put this is to say 
that a problem is well-posed if (i) a solution to the prob- 
lem exists, (ii) the solution is unique, and (hi) the solu- 
tion depends continuously on the data. A mathematical 
problem that does not possess these three attributes is 
called ill-posed. There are instances in which a prob- 
lem is well-posed but is unstable to perturbations in 
the data. That is, the continuity in (hi) is weak. Such 
problems are often called ill-conditioned or unstable. 
It turns out that most inverse problems of interest are 
ill-posed, or at least hl-conditioned. 

To put these concepts in the context of an inverse 
problem, violating (i) is equivalent to saying that, given 
data d and a forward map F(-), the equation F(m) = 
d does not have a solution. Violating condition (ii) is 
equivalent to the existence of two models, mi and m2, 
distinct from each other, such that 

F(m 1) = F(m2). 

Finally, violating condition (iii) is equivalent to saying 
that there are m\ and m2 such that 

II mi - m2 II is arbitrarily large 

even if 

ll-F(mi) - P(m2)|| is finite. 


IV.15. Inverse Problems 


329 


This notion is best understood in terms of limits of 
sequences. 

These ideas are very easy to see in the case of a 
finite-dimensional linear inverse problem where the 
mapping F(-) is a matrix operation: F(m) = Gm for 
some matrix G. Consider an inverse problem of find- 
ing m in Gm = d. Nonexistence is equivalent to having 
data d that are not in the range of G. Nonuniqueness 
amounts to G having a null space, i.e., nontrivial solu- 
tions to Gx = 0. If G has a null space with x being a 
null vector, then if m\ - m 2 = clx , Gm\ - Cm 2 = 0 
while a — ■ 00 , thus violating condition (iii). Finally, ill- 
conditioning means that the inverse of F has a large 
norm, a circumstance that can be succinctly described 
by the singular value decomposition [11.32]. 

The study of inverse problems is often about mathe- 
matically overcoming ill-posedness or ill-conditioning. 
Various techniques can help ameliorate its disastrous 
effects, as we now explain. 

2.5 A Priori Information and Preferences 

We often have prior information about the unknown in 
an inverse problem, and such information can be valu- 
able in solving the problem. We have previously alluded 
to situations where the model is of the form 

m = mo + Am. 

Under such circumstance, we can view m as a pri- 
ori information. There are also situations in which we 
know some characteristics of the unknown. An exam- 
ple is in imaging, where the model is an image. Suppose 
we know that the image is piecewise constant or that 
the image is sparse. We can then formulate the prob- 
lem such that the solution has the desired properties. 
The technique to be described next is often employed 
to enforce a preference. 

2.6 Regularization 

Tikhonov regularization is an approach for solving an 
ill-posed or ill-conditioned problem. It involves intro- 
ducing auxiliary terms that make the problem well- 
posed. The classical inverse problem considered by 
Tikhonov is a linear inverse problem. Here, the forward 
map is a linear operator acting on the model parame- 
ters, given by Gm. We are given data d and wish to find 
m. Instead of posing an equation for m, we seek the 
minimizer to 

|| Gm -d|| 2 + A||Bm|| 2 . (1) 


Here, we have used the 1 . 2 -norm. The second term in (1) 
is often referred to as the penalty term. The linear oper- 
ator B is called the regularizing operator. Some mathe- 
matical properties are imposed on B. Examples include 
B = I (the identity) when enforcing smallness in m, 
and B equal to the equivalent of the spatial derivative 
with m when it represents parameters distributed over 
space when enforcing smoothness of B. 

The point of Tikhonov theory is a notion of “consis- 
tency.” To get a sense of its mathematical significance, 
suppose that G is not invertible. Let mo be the true 
model; thus, do = Gmo is noiseless data for the inverse 
problem. We use e = d - do, to represent the noise in 
the data for a given experiment. Denote by m the min- 
imizer of (1) for a given d. Tikhonov theory states that 
under the right mathematical assumptions on G and 
B, there exists a sequence of values for A such that if 
there is a sequence of data d — • do. then the minimizer 
m — mo. 

While such a theory is of limited practical value, 
the use of regularization is very powerful for solving 
ill-conditioned inverse problems. The main challenge 
in its use is choosing the penalty parameter A. There 
are two main methods used for setting A: the L-curve 
method and the method of cross-validation. The former 
is intuitive, whereas the latter is based on statistical 
assumptions on the noise in the data. 

It should be pointed out that different choices of 
the regularizing operator B and the norms used can 
give very different answers. For example, when B is the 
derivative operator and the Li-norm is used, we get the 
well-known total variation regularization. When B = I 
and the Li-norm is again used, we get what is now 
known as compressed sensing [VII. 10]. 

2.7 A Statistical Approach 

Bayesian statistics offer a different point of view when 
solving inverse problems. In this statistical approach 
we view the inverse problem as one where we have a 
prior distribution on the unknown model m. The data 
d is used to update the prior. 

To be more precise, instead of a model m and an 
observation d, we consider random variables M and D 
representing models and data. By having a prior for the 
model, we mean that we have a probability distribution 
Trim ). For example, m could be finite dimensional and 
Trim ) could be a multivariate normal distribution. In 
the statistical framework, data d is observed and we 
wish to know the conditional probability distribution 
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Tt{m\d). The question becomes that of updating the 
prior probability distribution for m given that we have 
observed d. 

The update is done using bayes’s rule [V.ll §l].This 
requires a model for the joint probability distribution 
rt(m,d), which may be as simple as 

n(m,d) = f{Gm - d), 

depending on the assumptions. 

Computationally, statistical methods can be very 
costly. However, the payoff can be large because one 
is often able to get an answer to an inverse prob- 
lem with confidence intervals. The method of apply- 
ing the Bayesian framework to inverse problems is 
often referred to as inversion with uncertainty quan- 
tification. Aside from the computational cost, a cri- 
tique sometimes leveled against this approach is that 
the prior probability distribution for the model Tt(m) 
is very difficult to obtain in practice. Nevertheless, 
Bayesian approaches have become a major tool for 
solving inverse problems. 



Figure 2 Schematic of X-ray CT in two dimensions. The 
dashed line represents the path of the X-ray. The projection 
is parametrized by the angle 6. Data at this angle are the 
function Pg(s), where 5 parametrizes the ray displacement 
as indicated. 


3 Selected Examples of Inverse Problems 

3.1 X-Ray Computed Tomography 

X-ray CT is a method by which the X-ray absorption 
coefficients of a body are estimated. We have already 
briefly mentioned the mathematical problem associ- 
ated with CT in section 2.1. We provide more detail 
here. 

We consider the problem in two dimensions. In fig- 
ure 2 the path of the X-ray is indicated by the dashed 
line, denoted by Lg iS . The angle 0 is one of the parame- 
ters of the ray, the other being the displacement s. The 
ray path Lg <s is given by 

x cos 9 + y sin 0-5 = 0. 

Letting p(x,y) be the attenuation and letting df be the 
length element along the path, the forward map for the 
inverse problem of CT is 

Rg[p](s) = [ p(x,y)d£. 

I Lq iS 

In the inverse problem, we are given data Pg(s) for 0 <C 
9 < 2tt and a ^ 5 ^ b, and we solve 

i ?(9 [p] (X) = Pg(s) 

for p(x,y). We may assume that p(x,y) is compactly 
supported. Modern CT can be attributed to the work of 
Cormack and Hounsfield, who shared the 1979 Nobel 
Prize in Physiology or Medicine for their work. However, 


the mathematics that makes CT a reality dates back to 
the work of Radon, who in 1907 showed that a function 
could be reconstructed from its projections. One can 
use Radon’s theory to obtain an inversion formula for 
p(x, y) given Pg ( 5 ), which is given by 
1 C n f°° Pg(s) 

p(x,y) = — — y wr • a dsdd, 

2tt- Jo J-co x cos 6 + y sm 9 — s 

although the formula is of limited practical use. Here, 

Pg(s) is the derivative of Pg(s) with respect to 5 . 

Several practical computational approaches are avail- 
able for solving the inverse problem of X-ray CT. Some 
are based on Fourier transforms, while others are based 
on solving a system of linear equations. Modern CT 
scanners are designed for high-speed data acquisition 
and low X-ray dosage. Algorithms are designed for each 
machine in order to optimize computational efficiency 
and accuracy. 

3.2 Seismic Travel-Time Tomography 

In travel-time tomography, the data consist of records 
of the time taken for a wave packet to leave a source, 
travel through the Earth’s interior, and arrive at a point 
on the surface of the Earth. Such data are called travel- 
time data and can be extracted from seismic records. 
In the simplest model, the Earth is flat and two dimen- 
sional. The sound speed depends only on depth, and 
the source is located on the surface. Using geometrical 
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Figure 3 Schematic of a travel- time inverse problem. A wave 
source is located at the origin. A ray emanates from the 
source, and its trajectory is determined by the take-off angle 
and the sound speed profile c(z). It returns to the surface 
at x = Xip), where p is the ray parameter. The inverse 
problem is tohndc(z) from Xip). 


optics we follow a ray emanating from the source and 
trace its path back to the surface (see figure 3). A ray is 
parametrized by the angle it makes with the surface at 
the source. 

Suppose that the unknown sound speed profile as a 
function of depth is c(z). Let us locate the source at 
the origin. A ray leaves the origin, making an angle io 
with respect to the vertical. According to geometrical 
optics, the quantity 


V = 


smi 

c(z) 


is constant along the ray, where i is the angle the ray 
makes at depth z, measured clockwise from the verti- 
cal. This quantity is called the ray parameter. Fermat’s 
principle governs the ray trajectory, and the point at 
which a ray with parameter p arrives on the surface, 
X ( p ) , is given by 

X(p) = , paz . 

jc(z)~ 2 - p 2 

The function Zip) is the maximum depth the ray 
reaches in its trajectory, and it is the solution to 


c(Zip)) = 1/p, 


since the ray’s angle i is tt/2 at this point. In the inverse 
problem, we are given travel-time data X 'ip) for a range 
of values of p, and the goal is to recover c(z) to a 
maximum depth possible. 

This problem was considered by Herglotz (1907) and 
Wiechert (1910), who provided a formula for the solu- 
tion. It is related to the well-known Abel problem in 
which one attempts to find the shape of a hill given the 
return times of a particle that goes up the hill at fixed 
velocities. Bocher (1909) studied the Abel problem and 


arrived at the mathematical conditions under which 
the hill can be constructed uniquely. The techniques 
of Bocher can be applied to the geophysical travel-time 
problem. A solution that gives z as a function of c is 
available for this problem: 


z(c) 


1 f 1 . X(P) dp. 

TT V ^p 2 -C ~ 2 


The formula is valid when Xip) has continuous deriva- 
tives. In particular, it is not valid when Xip) is mul- 
tivalued, which occurs when c(z) has a “low-velocity 
zone,” i.e., an interval in which c(z) dips below an 
otherwise-increasing trend. 


3.3 Geophysical Inversion for Near-Surface 
Properties 

In an attempt to determine the near-surface (several 
kilometers in depth) properties of the Earth, geophysi- 
cists perform experiments on the surface to gather data 
from which they hope to infer the properties of the sub- 
surface. One such experiment is seismic exploration, 
wherein elastic waves are used to probe the Earth. 
As these waves propagate into the Earth, the inhomo- 
geneities in the Earth diffract and reflect the waves in 
a process called scattering. The scattered waves are 
measured on the surface using geophones. The inverse 
problem is to determine the material properties of the 
Earth from the scattered data. 

In a typical measurement, a localized source, in 
the form of heavy equipment that “thumps” the sur- 
face, is introduced. An array of geophones collects the 
response of the Earth to the source. Data collection 
is over a given time window. Such a measurement is 
called a “shot.” The measurement is then repeated at 
another source location. The totality of the data con- 
sist of geophone readings from a number of shots (see 
figure 4). 

Geophysicists have developed a number of approxi- 
mation methods to interpret such data. These methods 
have been successful in situations where the wave phe- 
nomena are well modeled by the approximations. Cur- 
rent research focuses on more accurate modeling of 
the wave phenomena, the development of problem for- 
mulations that are computationally feasible, and com- 
putational methods that exploit the power of modern 
computers. 

In the simplest useful approximation, the Earth is 
modeled as an acoustic fluid occupying a two- or three- 
dimensional half-space, with spatially variable bulk 
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Geophone 

readings 



Figure 4 Data collection in seismic exploration. The vibra- 
tor truck is capable of “thumping” the surface and produces 
waves that travel into the interior of the Earth. These waves 
are reflected by the inhomogeneities in the Earth and return 
to the surface. Geophones record the reflected waves. This 
“shot” is repeated after the source and the receiver array 
have been moved to a new location. 

modulus i<(x) and constant density p. Despite the 
neglect of significant physics, this model underlies 
the vast bulk of contemporary seismic imaging and 
inversion technology. 

The examples that follow are two dimensional. The 
state of the acoustic Earth is captured by the pres- 
sure field u(x, t): x = (x\,X 2 ) are spatial coordinates 
with X 2 > 0, and t is time. The acoustic wave equation 
satisfied by u(x, t) is 

ivy = - A u + f il] (x,t), 

ot d p 

where f (l) (x,t) is a time-varying and spatially localized 
source of energy and / (I) (x, t) = 0 for t < 0. The index 
i refers to the location of the source. On the bound- 
ary {x: X 2 = 0}, representing the Earth’s surface, we 
assume the Dirichlet boundary condition u = 0. This 
is a reasonable approximation, as pressures in the air 
are orders of magnitude smaller than pressures in the 
water or rock. 

Since the seismic experiment consists of several 
shots, it produces several pressure fields u (t) (x,f), 
caused by several source fields / u) (x,t) represent- 
ing varying placements of the “thumping” machinery, 
i = 1,2 ,.,.,n. The data recorded by the geophones 
is the pressure at a number of geophone locations 

{Xj l) : j = 1, 2 m} near the surface (but not on it: 

there, u = 0!). Note that the geophone locations may 
depend on the shot index i; the entire apparatus of the 
survey may move from shot to shot, not just the energy 
source equipment. 


The seismic inverse problem can be posed as a 
data-fitting optimization problem, as explained in sec- 
tion 2.2. In the terminology of that section, the model 
m is the bulk modulus k(x) and the data d are the vec- 
tors of functions of time, one for each shot and receiver 
location. The value of the forward map F is the sam- 
pling of the pressure field predicted via solution of the 
wave equation: 

F[k] = {w (t) (Xj l \ t) : i = 1 , . . . ,n, 

j = 1, . . . , m, 0 < t < T}. 

The inverse problem is to determine k(x) (insofar as 
possible) from the data {dj\t): i = 1 j = 

0 < t < T}: 

Tl/c] == d. 

The most commonly used optimization formulation of 
this problem seeks to choose k to minimize the mean- 
square residual 

nm.pT 

WF[K]-d\\ 2 = X X \u {i) (xf ,t) - df (t)\ 2 At. 

i=lj=l J ° 

In the current seismic literature, estimation of k (or 
other mechanical parameter fields) by minimization of 
the mean-square residual is known as full waveform 
inversion (FWI). This approach to extracting informa- 
tion about the Earth from seismic data was first stud- 
ied in the 1980s. Early implementations were gener- 
ally unsuccessful, partly because at that time only two- 
dimensional simulation was feasible. Since the turn 
of the twenty-first century, though, advances in algo- 
rithms and hardware performance have enabled three- 
dimensional simulation and iterative minimization of 
the mean-square residual for three-dimensional distri- 
butions of /<■ and similar parameters. While the tech- 
nology is still in the early stages of development, it has 
already become clear that FWI can produce information 
about the geometry of subsurface rock that is far bet- 
ter in terms of both quality and resolution than that 
obtainable with older methods based on more drastic 
approximation. 

While plenty of FWI success stories can be found, 
we see failures too: failures driven by a fundamental 
mathematical difficulty. This issue was discovered in 
the 1980s, the first period of active research on FWI, 
and remains the main impediment to its widespread 
use. 

The problem is easy to illustrate with a two-dimen- 
sional example. Figure 5(a) displays an example two- 
dimensional bulk modulus k that is widely used in 
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Figure 5 (a) The Marmousi model for the bulk modulus, 
(b) The simulated seismic data associated with the model. 

research on FWI: the Marmousi model. We choose one 
source location and solve the wave equation numer- 
ically to produce the simulated data depicted in fig- 
ure 5(b). The simulated data consist of pressure read- 
ings at a set of geophones located equidistant from 
each other, starting some distance from the source. The 
figure displays the recordings for a single shot, at hor- 
izontal position 6 km from the left edge of the model 
in figure 5(a), at a depth of 6 m. The horizontal axis in 
figure 5(b) is receiver index (j = 1 , . . . ,m, m = 96) and 
the vertical axis is time. The pressure reading is given 
a gray level value. 

To demonstrate the sensitivity of the cost function 
to a small change in the bulk modulus, we made a 
convex combination of the bulk modulus in figure 5(a) 
(95%) with a constant background of ko = 2.25 GPa 
(5%). Let k i be the bulk modulus in figure 5(a), and let 
K 2 = 0.95ki + 0.05«o. We will measure the difference in 
the forward maps for these two bulk moduli. The for- 
ward map at i < 2 is the simulated pressure field shown in 



Figure 6 (a) The simulated seismic data with a different 
model, consisting of 0.95 times the model in figure 5(a) 
added to 0.05 times a constant bulk modulus of 2.25 GPa; 
the net difference is well under 5% root mean square, (b) The 
residual, which is the difference between figure 5(a) and 
part (a) of this figure. 

figure 6(a), and the residual, the difference between the 
two maps, T(k 2 > - F(k\ ), is shown in figure 6(b). The 
difference F(k 2 > - F(k 1 ) is very visible. In fact, the 12- 
norm of the difference is almost twice as large (184%) 
as the norm of the simulated data F(ki). 

From this example we can already see that the pre- 
dicted data may change very rapidly as a function of 
the model. The reason for this rapid change is that the 
oscillatory signal has shifted in time by a large fraction 
of a wavelength, which yields a large mean-square (L 2 ) 
change. This “cycle skipping” phenomenon arises from 
the influence of the bulk modulus on the speed of the 
waves embedded in the solution of the wave equation. 
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The size of the predicted data (figure 6(a)) has not 
changed much, however. For a variety of reasons, prin- 
cipally conservation of acoustic energy, the overall size 
of the data (measured with the I 2 -norm, for example) 
depends only weakly on the model (k). This combi- 
nation of very rapid change while staying the same 
size suggests that the predicted data cannot continue 
to “run away” from the “observed” data; the distance 
between them must oscillate. 

In another numerical experiment we made a con- 
vex combination of 90% Marmousi bulk modulus and 
10% constant bulk modulus, i.e., «3 = 0.9«i + O.Iko. 
We found ||F(K 3 ) - F(ki)||/||F(ki)|| to be 144%. We 
note that this is smaller than ||F(K 2 ) — F(ki)||/||F(ki)||. 
That is, we have observed that the least-squares objec- 
tive function (the mean-square residual) has local min- 
ima other than its global minimum, at least when 
restricted to the line segment through the target model 
in figure 5(a) and the homogeneous k = kq. 

This is a very serious obstacle because the compu- 
tational size of these problems is so large that only 
variants of local search algorithms (mainly, Newton’s 
method) are (barely) feasible, and these find only local 
extrema. 

The problem of local minima turns up with consider- 
able regularity in research into, and field applications 
of, FWI. Since, in the field, we do not know the answer 
a priori, a frequent result is “zombie inversion”: a fail- 
ure to accurately estimate the structure of the Earth 
without any effective indication of failure. Quality con- 
trol of FWI is a current research topic of great interest. 
Acquisition of unusually low-frequency data can lead to 
an escape from the local-minimum problem, and this 
is another topic of great current interest. At the time 
of writing, there is still no mathematical justification 
of the effectiveness of low-frequency inversion. Finally, 
considerable effort is being put into alternative objec- 
tive functions with better convexity properties than the 
least-squares function studied here. 

3.4 Electrical Impedance Tomography 

In electrical impedance tomography we are given an 
object whose spatially dependent conductivity is un- 
known (see figure 7). Electrostatic measurements are 
taken on the boundary of the object. The objec- 
tive is to determine the unknown conductivity from 
these boundary measurements. This problem is often 
referred to as the “Calderon problem” because Alberto 
Calderon was the first to study it mathematically, 
in 1980. 



Figure 7 A schematic of electrical impedance tomography. 
The body Q has variable conductivity in its interior. The 
inverse problem is to determine the conductivity distribu- 
tion from boundary measurements. In practice, the mea- 
surement is done by attaching electrodes to the bound- 
ary of Q, indicated in the figure by black circles. Current 
is assigned to the electrodes, while voltages are measured 
at all the electrodes. The data so collected represent a 
sampling of the Dirichlet-to-Neumann map. 


Let cr(x), where x is the spatial variable in two or 
three dimensions, represent the conductivity of the 
object Q. Potential theory, which follows from Ohm’s 
law, states that the electrical potential in Q satisfies 

V • a{x)Vu = 0 


if there are no sources in the interior of Q. In the 
idealized mathematical problem, we are allowed to 
prescribe any boundary value g to u (x), i.e., 

u\dn = g- 


For each g we measure the normal derivative of the 
potential u(x) at the boundary: 

du 


dv 


do 


= /■ 


We therefore have an f for every g. The totality of these 
pairs of functions is our data. The mathematical name 
for such data is the Dirichlet-to-Neumann map. It is 
clear that it depends on cr (x) . Denoting this map by Ft , 
we have / = r a g. The inverse problem is to determine 
a(x) given F CT . 

In practice, we are not given the Dirichlet-to-Neu- 
mann map in its entirety. Rather, we are given a finite 
collection of pairs 

i= 1,2 ,,..,n. 


A practical approach to measurement is to attach a 
number of electrodes to the surface of the object. 


IV. 16. Computational Science 


335 


Current is made to flow between a pair of electrodes, 
while voltage is measured on all the electrodes. There- 
fore, in practice, we do not have but rather 

samples of their values at the electrodes. 

The Calderon problem has received a lot of atten- 
tion in the past thirty years due to its practical impor- 
tance. Barber and Brown developed a practical medi- 
cal imaging device using this principle in 1984. The 
same problem crops up in other applications such as 
nondestructive testing and geophysical prospecting. 

The mathematical question of unique determination 
of cr(x) from r a is a well-developed area. The results 
depend on the number of dimensions, in two dimen- 
sions, the earliest uniqueness result (from 1984) is due 
to Kohn and Vogelius, who showed that if cr(x) is ana- 
lytic then it is uniquely determined by T a . The next 
seminal result on this subject was due to Sylvester 
and Uhlmann in 1987. They proved global uniqueness 
results in dimensions three and higher. Their approach 
is based on the powerful method of complex geomet- 
rical optics. Global uniqueness was proved a few years 
later by Nachman in two dimensions using a method 
called 3 -bar. 

Recent work has focused on how rough cr(x) can be 
while retaining uniqueness and on the case where cr(x) 
is a matrix function. The latter study has led to the dis- 
covery of transformation optics [VI. 1], which has 
provided a strategy for cloaking and invisibility. 

While uniqueness can be established, it is very dif- 
ficult to reconstruct <r(x) from actual measurements 
in practice. This is due to the fact that the problem 
of determining cr(x) from T a is very ill-conditioned. 
There have been several successful approaches. They 
are based on linearization of the relationship between 
r a and cr, on least-squares fitting, and on a direct 
reconstruction method called the 3-bar method. The 
last approach involves synthesizing data for a scatter- 
ing problem from the measured data, which in itself is 
ill-conditioned. 

4 Related Problems 

There are problems that could be viewed as inverse 
problems but do not go by that name. One set arises 
in the study of control theory for distributed parame- 
ter systems. Statisticians view an inverse problem as a 
problem of parameter estimation from data. This point 
of view precedes the more recent statistical approaches 
to inverse problems. 


5 Outlook 

Inverse problems is an active research field with sev- 
eral devoted journals and a community of researchers 
coming from many disciplines. Many of the prob- 
lems arise in engineering and scientific applications 
but the core language and tools are mathematical in 
nature. Progress in new imaging technologies, includ- 
ing functional magnetic resonance imaging and pho- 
toacoustic tomography, has been made possible by 
advances in inverse problems. Inverse problems tech- 
niques have also made important contributions to engi- 
neering fields such as nondestructive evaluation of 
materials and structures. 
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IV. 16 Computational Science 

David E. Keyes 


1 Defi n itions 

Computational science— the systematic study of the 
structure and behavior of natural and human-engi- 
neered systems accomplished by computational means 
—embraces the domains of mathematical modeling, 
mathematical analysis, numerical analysis, and com- 
puting. In order to reach the desired degrees of pre- 
dictive power, fidelity, and speed of turnaround, com- 
putational science often stretches the state of the art 
of computing in terms of both hardware and soft- 
ware architecture. Because each decision made along 
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the way — such as the choice between a partial dif- 
ferential equation (PDE) and a particle representation 
for the model, between a structured and an unstruc- 
tured grid for the spatial discretization, between an 
implicit and an explicit time integrator, or between a 
direct and an iterative method for the linear solver- 
can greatly affect computational performance, compu- 
tational science is a highly interdisciplinary field of 
endeavor. 

Definitions of computational science and the related 
discipline of scientific computing are not universal, but 
we may usefully distinguish between them by consid- 
ering computational science to be the vertically inte- 
grated union of models and data, mathematical tech- 
nique, and computational technique, while scientific 
computing is the study of techniques in the inter- 
section of many discipline-specific fields of compu- 
tational science. Computational chemistry, computa- 
tional physics, computational biology, computational 
finance, and all of the flavors of computational engi- 
neering (chemical, civil, electrical, mechanical, etc.) 
depend on a common set of techniques connecting 
the conceptualization of a system to its realization 
on a computational platform. These techniques — such 
as representing continuous governing equations with 
a discrete set of basis functions, encoding this repre- 
sentation for digital hardware, integrating or solving 
the discrete system, estimating the error in the result, 
adapting the computational approximation, visualiz- 
ing functionals of the results, performing sensitiv- 
ity analyses, or performing optimization or control- 
span diverse applications from science and engineering 
and are the elements of scientific computing. Chain- 
ing together such techniques to form a simulation 
to address a specific application constitutes computa- 
tional science. Practitioners often use the more inclu- 
sive terms “computational science and engineering” 
and “scientific and engineering computing,” but in this 
article the “engineering” applications are understood 
to be subsumed in the “scientific.” 

Practitioners also often distinguish between com- 
putational science executed by means of simulation 
and that executed by “mining” data, the latter with- 
out necessarily possessing a model. Simulation is often 
referred to as the “third paradigm” of science and data 
analytics as the “fourth paradigm.” These are in appo- 
sition to the “first paradigm” of theory, which is mil- 
lennia old, and the “second paradigm” of controlled 
experiment or observation, which is centuries old. 
The interplay of theoretical hypothesis and controlled 


experimentation defines the modern scientific method, 
in the era since, say, Galileo. Recently, the same stan- 
dards of reproducibility [VIII.5] and reporting have 
been applied to simulation, and the interplay among 
the first three paradigms has become highly produc- 
tive. Until recently, science generally tended to be 
“data poor,” but now most scientific campaigns are 
data rich, with many “drowning” in data, so a contem- 
porary challenge of computational science is to inte- 
grate the third and fourth paradigms. When a model 
is given, its simulation is called the “forward prob- 
lem.” The availability of data allows aspects of the 
underlying model to be inferred or improved upon; this 
is the domain of inverse problems [IV. 15]. Whereas 
forward problems are by design generally well-posed, 
inverse problems are often unavoidably ill-posed, due 
to ill-conditioning or nonuniqueness. There are many 
elements of scientific computing in this intersecting 
domain of the third and fourth paradigms, includ- 
ing data assimilation, parameter inversion, optimiza- 
tion and control, and uncertainty quantification 
[11.34]. These “post-forward problem” techniques are 
today applied throughout the computational sciences 
and they drive a considerable amount of research in 
applied and computational mathematics. Simulation 
and data analytics have attained peer status with theory 
and experiment in many areas of science. 

Computer simulation and data analytics enhance or 
leapfrog theoretical and experimental progress in many 
areas of science critical to society, such as advanced 
energy systems (e.g., fuel cells, fusion), biotechnology 
(e.g., genomics, drug design), nanotechnology (e.g., sen- 
sors, storage devices), and environmental modeling 
(e.g., climate prediction, pollution remediation). Sim- 
ulation and analytics also offer promising near-term 
hope for progress in answering a number of scientific 
questions in such areas as the fundamental structure 
of matter, the origin of the universe, and the functions 
of proteins and enzymes. 

2 Historical Trends 

The strains on theory were apparent to John von Neu- 
mann (1903-57) and drove him to pioneer compu- 
tational fluid dynamics and computational radiation 
transport, and to contribute to supporting fields, espe- 
cially numerical analysis and digital computer architec- 
ture. Models of fluid and transport phenomena, when 
expressed as mathematical equations, are inevitably 
nonlinear, while the bulk of mathematical theory (for 
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algebraic, differential, and integral equations) is lin- 
ear. Computation was to von Neumann, and remains 
today, the best means of making systematic progress in 
transport phenomena and many other scientific arenas. 
Breakthroughs in the theory of nonlinear systems come 
only occasionally, but computational understanding 
gains steadily with increases in speed and resolution. 

The strains on experimentation— the gold standard 
of scientific truth— have grown along with expectations 
for it. Unfortunately, many systems and many ques- 
tions are all but inaccessible to experiment. (We sub- 
sume under “experiment” both experiments designed 
and conducted by scientists and observations of natu- 
ral phenomena, such as celestial events, that are out of 
the direct control of scientists.) The experiments that 
scientists need to perform to answer pressing ques- 
tions are sometimes deemed unethical (e.g., because of 
their impact on living beings), hazardous (e.g., because 
of their impact on our planetary environment), unten- 
able (e.g., because they are prohibited by treaties), inac- 
cessible (e.g., those in astrophysics or core geodynam- 
ics), difficult (e.g., those requiring measurements that 
are too rapid or numerous to be instrumentable), too 
time-consuming (in that a decision would need to be 
made before the experimental results could be attained, 
e.g., certification of a system or a material lifetime), 
or simply too expensive (including detectors, the Large 
Hadron Collider facility cost over SlObn and the Inter- 
national Thermonuclear Experimental Reactor is pro- 
jected to cost well over $20bn). There is thus a strong 
incentive to narrow the regimes that must be inves- 
tigated experimentally using predictive computational 
simulation and data analytics. 

As experimental means approach practical limits, for 
more than two decades computational performance 
on real applications has improved at a rate of more 
than three orders of magnitude per decade (see high- 
performance computing [VII.12]). Over the same 
period, the acquisition cost of a high-performance com- 
puter system designed for scientific applications has 
fallen by almost three orders of magnitude per decade 
per unit of performance, together with the electrical 
power required per unit of performance, so that after 
a decade a system a thousand times more powerful 
costs about the same to own and operate. These trends 
in performance and cost are well documented in the 
history of the Gordon Bell Prizes and of the annual 
Top 500 lists (see www.top500.org). Increased compu- 
tational power can be invested in fine resolution of mul- 
tiscale phenomena, high fidelity, full dimensionality, 


integration of multiple interacting models in complex 
systems, and running large ensembles of forward prob- 
lems to gain scientific understanding. Unfortunately, 
however, after more than two decades of riding Moore’s 
law, physical barriers to extrapolating the favorable 
energy trends in computing are approaching, and these 
barriers are now significant drivers in computational 
science research. 

Contemporary computational science stands at the 
confluence of four independently and fruitfully devel- 
oping quests: the mathematization of nature, epit- 
omized by Newton; numerical analysis, epitomized 
by von Neumann; high-performance computer per- 
formance, epitomized by Cray; and scientific soft- 
ware engineering, represented by numerous contempo- 
raries. A predigital computer vision of computational 
science was proposed by L. F. Richardson in his 1922 
monograph on numerical weather prediction. Richard- 
son's parallel computer was an army of human cal- 
culators arrayed at latitudinal and longitudinal incre- 
ments along the interior surface of a mammoth globe. 
Richardson's monograph appeared prior to the exis- 
tence of not only any digital computer but also to 
the stability analysis of finite-difference methods for 
PDEs, which was addressed by Courant, Friedrichs, and 
Lewy in 1928. Since computers now commit O(10 15 ) 
floating-point rounding errors per second on simula- 
tions that execute for days, numerical analysis has a 
central role. 

Until von Neumann and Goldstein’s landmark 1947 
work on the stability of Gaussian elimination, it was not 
clear how large a system (in this case, of linear equa- 
tions) could be a candidate for computational solution 
using floating-point arithmetic, due to the accumula- 
tion of rounding errors. We can now make use of dig- 
ital computers to solve systems of equations billions 
of times larger than the authors considered. Thanks 
to continued advances in numerical analysis, and par- 
ticularly in error analysis and optimal algorithms, the 
size of simulations in many fields continues to grow 
to take advantage of all of the resolution that com- 
puters can provide. Of course, for systems that are 
fundamentally “chaotic” in continuous form — meaning 
that solution trajectories that started infinitesimally far 
apart can diverge exponentially in finite time— the sci- 
entific benefit of greater resolution must be carefully 
considered in relation to the exponent of divergence. In 
1950, von Neumann (with Charney and Fjortoft) intro- 
duced a form of PDE stability analysis that was comple- 
mentary to that of Courant, Friedrichs, and Lewy. This 
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analysis clarified a fundamental bound on the size of 
the time step of a simulation relative to the fastest sig- 
nal speed of the phenomenon being modeled and the 
granularity of the spatial sampling. The 1950s, 1960s, 
and 1970s witnessed advances in discretization and 
equation solving, the development of high-level pro- 
gramming languages [VII. 11], and improvements in 
hardware that allowed ambition for computational sim- 
ulation to soar. By 1979, computational fluid dynami- 
cists were able to claim dramatic reductions in wind 
tunnel and flight testing of the Boeing 767 thanks to 
simulations performed to narrow the parameter spaces 
of design and testing. These simulations, heroic at the 
time though extremely modest by contemporary stan- 
dards, gave practical and economic embodiment to the 
dream of computational science. 

Two landmarks defining the ambitions and culture 
of computational science were the articles by Peter 
Lax in 1986 championing simulation as the “third 
leg” of the scientific platform and by Ken Wilson in 
1989 on “grand challenges” in computational science. 
These articles were influential in unlocking funding 
for computational science campaigns in government, 
academia, and industry and in democratizing access 
to high-performance computers, which had previously 
been the province of relatively few pioneers at national 
and industrial laboratories. 

The formation of the U.S. federal Networking and 
Information Technology Research and Development 
organization in 1992 coordinated investments in re- 
search, training, and infrastructure from ten federal 
agencies in high-performance computing and com- 
munications and provided recognition for the inter- 
dependent ecology of applications, algorithms, soft- 
ware, and architecture. The U.S. Department of Energy’s 
Accelerated Strategic Computing Initiative of 1997 
enshrined predictive simulation as a substitute for 
weapon detonations under a nuclear test ban in the 
United States and dramatically expanded investment 
in training computational scientists and engineers. The 
Department of Energy’s Scientific Discovery through 
Advanced Computing (SciDAC) program, established in 
2001, extended the culture of the Accelerated Strate- 
gic Computing Initiative program throughout mission 
space, from astrophysics and geophysics to quan- 
tum chemistry and molecular biology. The document 
that founded the SciDAC program declared the com- 
puter, after tuning for accuracy and performance, to 
be a reliable scientific instrument like any microscope, 


telescope, beamline, or spectrometer and of more gen- 
eral purpose. 

As the Internet expanded to universities, and as the 
World Wide Web (developed at CERN for the sharing 
of experimental data sets among theoretical physicists) 
connected scientists and engineers globally, computa- 
tional science and engineering forged a global iden- 
tity. Today, scientific professional societies such as the 
Society for Industrial and Applied Mathematics, the 
Institute of Electrical and Electronics Engineers, the 
Association for Computing Machinery, and the Amer- 
ican Physical Society sponsor activities and publica- 
tions that promote the core enabling technologies of 
scientific computing. The expectations of large-scale 
simulation and data analytics to guide scientific dis- 
covery, engineering design, and corporate and pub- 
lic policy have never been greater. Nor have the cost- 
effectiveness and power of computing hardware, which 
are now driven by commercial market forces far beyond 
the scientific yearnings that gave birth to computing. 
Computational science and scientific computing span 
the gap between ever more demanding applications and 
ever more complex architectures. These fields provide 
enormous opportunities for applied and computational 
mathematicians . 

3 Synergies and Hurdles 

Simulation has aspects in common with both theory 
and experiment. First, it is fundamentally theoretical, 
in that it starts with a model, typically a set of equa- 
tions. A powerful simulation capability breathes new 
life into theory by creating a demand for improve- 
ments in mathematical models. Simulation is also fun- 
damentally experimental, in that upon constructing 
and implementing a model, one may systematically 
observe the transformation of inputs (or controls) into 
outputs (or observables). Simulation effectively bridges 
theory and experiment by allowing the execution of 
“theoretical experiments” on systems, including those 
that could never exist in the physical world, such as a 
fluid without viscosity or a perfectly two-dimensional 
form of turbulence. Computation also bridges theory 
and experiment by virtue of the computer, which serves 
as a universal and versatile data host. Once data have 
been digitized, they can be compared side by side 
with simulated results in visualization systems built 
for the simulations. They may also be reliably trans- 
mitted and retrievably archived. Moreover, simulation 
and experiment can complement each other by allow- 
ing a more complete picture of a system than either 
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can provide on its own. Some data may be unmeasur- 
able even with the best experimental techniques avail- 
able, and some mathematical models may be too sensi- 
tive to unknown parameters to invoke with confidence. 
Simulation can be used to “fill in” the missing exper- 
imental fields, using experimentally measured fields 
as input data. Data assimilation can also be system- 
atically employed throughout a simulation to keep it 
from “drifting” from measurements, thus overcoming 
the effect of modeling uncertainty. Many data-mining 
techniques in effect build statistical models from data 
that may be as predictive as executing the dynamics of 
a traditional model. 

Some of the ingredients that are required for suc- 
cess in computational science include insights and 
models from scientists; theory, methods, and algo- 
rithms from mathematicians; and software and hard- 
ware infrastructure from computer scientists. There 
are numerous ways to translate a physical model into 
mathematical algorithms and to implement a compu- 
tational program on a given computer. Decisions made 
without considering upstream and downstream stages 
may cut off highly productive options. A bidirectional 
dialogue, up and down the hierarchy at each stage, can 
ensure that the best resources and technologies are 
employed. To bridge the algorithmic gap between the 
application and the architecture, mathematicians must 
consider implications of both. For example, in repre- 
senting a field that is anticipated to be smooth, one 
may select a high-order discretization. In anticipating 
a port to a hybrid architecture emphasizing SIMDiza- 
tion (SIMD stands for single-instruction-multiple-data), 
one may produce uniformly structured data aggre- 
gates even if this over-resolves the modeled phenom- 
ena somewhere relative to the required accuracy. If one 
requires the adjoint of an operator to perform opti- 
mization in the presence of PDE constraints, one may 
pay an extra price in the forward problem to employ 
an implicit Newton method, which creates a Jacobian 
matrix that is then available for adjoint use. However, 
this Jacobian matrix may be the largest single working 
set of data in the problem, and its layout may' dominate 
implementation decisions. 

While the promise of computational science is pro- 
found, so are its limitations. The limitations often come 
down to a question of resolution. Though of vast size, 
computers are triply finite: they represent individual 
quantities only to a finite precision, they keep track 
of only a finite number of such quantities, and they 
operate at a finite rate. Although all matter is, in fact, 


composed of a finite number of atoms, the number of 
atoms in a macroscopic sample of matter, on the scale 
of Avogadro’s number, places simulations at macro- 
scopic scales from “first principles” (i.e., from the quan- 
tum theory of electronic structure) well beyond any 
conceivable digital computational capability. Similar 
problems arise when timescales are considered. For 
example, the range of timescales in protein folding is 
at least twelve orders of magnitude, since a process 
that takes milliseconds to complete occurs in molec- 
ular dance steps (bond vibrations) that occur in fem- 
toseconds (a trillion times shorter); again, this is far too 
wide a range to routinely simulate using first principles. 

Some simulations of systems adequately described 
by PDEs of the macroscopic continuum (fluids such as 
air or water) can be daunting even for the most pow- 
erful computers foreseeably' available. Today’s com- 
puters are capable of execution rates in the tens of 
petaflop/s (1 petaflop/s is 10 15 arithmetic operations 
per second) and can cost tens to hundreds of mil- 
lions of dollars to purchase and millions of dollars per 
year to operate. However, to simulate fluid mechani- 
cal turbulence in the boundary and wake regions of 
a typical vehicle using “first principles” of continuum 
modeling (Navier- Stokes) would tie up such a com- 
puter for months, which makes this level of simula- 
tion too expensive and too slow for routine use. To 
attempt first-principles modeling based on Boltzmann 
kinetics would be far worse. Practitioners must decide 
whether upscaled continuum models such as Reynolds- 
averaged Navier-Stokes or large-eddy simulation v\ill 
be adequate or whether discrete particle methods such 
as lattice Boltzmann or discrete simulation Monte Carlo 
would be more efficient to resolve certain scientific 
questions on certain architectures. 

The “curse of dimensionality” leads to the phe- 
nomenon whereby increasing the resolution of a sim- 
ulation in each relevant dimension quickly eats up 
any increases in processor power. For explicitly time- 
discretized problems in three dimensions, the com- 
putational complexity of a simulation grows like the 
fourth power of the resolution in one spatial dimen- 
sion. Therefore, an increase in computer performance 
by a factor of 100 provides an increase in resolution 
in each spatial and temporal dimension by a factor of 
only slightly more than 3. The “blessing of dimensional- 
ity” refers to an entirely different observation: complex 
dynamics of real systems can often be represented in 
a relatively small number of principal components. If 
one represents the dynamics in terms of this efficient 
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basis, rather than the nodal basis that is implicit in tab- 
ulating solution values on a mesh, one’s predictive abil- 
ity may effectively advance by the equivalent of several 
computer generations with a few strokes of a pen. 

A standard means of bringing more computational 
power to bear on a problem is to divide the work to 
be done over many processors by means of domain 
decomposition that partitions mesh cells or particles 
and employs message-passing between the computa- 
tional nodes. Typically, the same program runs on each 
node, but it may execute different instruction streams 
on different nodes due to encountering different data. 
This programming model is variously known as “single- 
program/multiple-data,” “bulk synchronous process- 
ing,” or “communicating sequential processes.” Par- 
allelism in computer architecture can provide addi- 
tional factors of hundreds of thousands, or more, to 
the aggregate performance available for the solution 
of a given problem, beyond the factor available from 
Moore’s law (a doubling of transistor density about 
every eighteen months) applied to individual proces- 
sors alone. However, the desire for increased resolution 
and, hence, accuracy is seemingly insatiable. The CURSE 
of dimensionality [1.3 §2] means that “business as 
usual” in scaling up today’s simulations by riding the 
computer performance curve alone, without attention 
to algorithms that squeeze more scientific information 
out of each byte stored or flop processed, will not be 
cost-effective. 

The “curse” of knowledge explosion— namely, that no 
one computational scientist can hope to track advances 
in all of the facets of the mathematical theories, com- 
putational models and algorithms, applications and 
computing systems software, and computing hardware 
(computers, data stores, and networks) that may be 
needed in a successful simulation effort— is another 
substantial hurdle to the progress of simulation. Effec- 
tive collaboration is a means by which it could be 
overcome. 

4 Cross-cutting Themes 

Computational science embraces diverse scientific dis- 
ciplines; however, themes that cut across different dis- 
ciplines, and even themes that appear to be universal, 
do emerge. We discuss some of these in this section. 

4. 1 Balance of Errors 

Computational science proceeds by stages of suc- 
cessive transformation and approximation, including 


continuous modeling, numerical discretization, digital 
solution, and analysis and interpretation of the results, 
at each of which errors are made. The overall error 
can be estimated by recursive application of the tri- 
angle inequality. The error estimates for individual 
stages should be made commensurate, since it adds 
no value to produce an exact algebraic solution to a 
discrete system that is only an approximate represen- 
tation, for instance, or to overly refine a discretization 
within which there are coefficients that are only imper- 
fectly known, or where the underlying physical “laws” 
are merely approximate correlations. One of the pri- 
mary and most difficult tasks in computational science 
is to estimate and balance the errors at each stage, to 
avoid overinvesting in work that does not ultimately 
contribute to the accuracy of the desired output. “Val- 
idation” and “verification” are two generalized stages 
of a comprehensive error analysis of a computational 
science campaign, the former measuring the degree to 
which the model represents the salient features under 
investigation and the latter measuring the quality of 
the solution to the model. The ultimate goal is to get 
the greatest scientific insight at the lowest “cost.” Error 
estimation in an individual execution is an essential 
aspect of insight, but it must be accompanied by judi- 
cious selection of cases to run and by performance 
tuning to run them efficiently. 

4.2 Uncertainty Quantification 

To be accepted as a means of scientific discovery, engi- 
neering design, and decision support, computational 
science must embrace standards of predictivity and 
reproducibility. Uncertainty can enter computations in 
a number of ways, but fundamentally, there are two 
types: epistemic and aleatoric. Epistemic uncertainties 
are those that could, in principle, be reduced to zero, 
leading to accurate deterministic simulations, but for 
which the modeler simply lacks sufficient knowledge. 
Aleatoric uncertainties are those that are regarded as 
unknowable, except in a statistical sense, because of 
fundamental randomness. Uncertainty in the output of 
a simulation can enter via the structure of the model 
itself, via parameters, or via approximations committed 
by the algorithm. Classical approaches to uncertainty 
quantification are Monte Carlo in nature: one executes 
the model for a cloud of inputs distributed according 
to various assumptions and then performs statistics 
on the output. This is direct, but it can be wasteful of 
simulation resources. A more progressive approach is 



IV. 16. Computational Science 


341 


to derive models for the statistical properties of the 
distribution and propagate them along with the uncer- 
tain solution. Uncertainty quantification has justifiably 
become an interdisciplinary research area dependent 
upon mathematical, statistical, and domain knowledge. 
It is also a driver within computational science that jus- 
tifies investment in scaling the hardware-software envi- 
ronment, since it greatly increases complexity relative 
to that of the execution of an individual forward model. 

4.3 The Ideal Basis 

The quest for the ideal basis in which to represent a 
problem is a recurring theme in scientific computing, 
since solving the same problem in an ideal basis can 
bring a critical improvement in computational com- 
plexity, that is, in the number of operations required 
to reach a solution of given accuracy. Rank, sparsity, 
and boundary conditions are among the features that 
allow intelligent choice of basis. Problems arrive at the 
doors of computational scientists with a choice of basis 
implicit; they are described by objects and operators 
that are convenient and natural to the scientific com- 
munity. However, the best formulation for computa- 
tion may be quite different. Some algorithms derive 
their own discrete bases that adapt to the operator, 
such as harmonic interpolants. Reduced-order models 
are important at the application level, where physical 
insight may guide a low- dimensional representation of 
an apparently high-dimensional system. They are also 
critical at the algorithmic kernel level, where a more 
automatically adaptive process may incrementally pro- 
duce a reduced basis, such as a krylov subspace [11.23] 
in linear algebra. Possessing multiple models for the 
same phenomena can increase algorithmic creativity. 
For instance, in linear problems, an inexpensive model 
maybe used to construct a preconditioner for an expen- 
sive one. In nonlinear problems, the solution to a crude 
model may provide a more robust starting estimate 
than is otherwise available. 

4.4 A Canonical “Algorithmic Basis” 

Though the applications of computational science 
are very diverse, there are a number of algorithmic 
paradigms that recur throughout them. Seven of these 
floating-point algorithms were identified by Phillip 
Colella in 2004, and he called them the “seven dwarfs.” 
This set was expanded by six integer algorithms in 2006 
by the Berkeley scientific computing group, bringing us 
up to “13 dwarves”: dense direct solvers, sparse direct 


solvers, fast transforms, AT-body methods, structured 
(grid) iterations, unstructured (grid) iterations, Monte 
Carlo, combinatorial logic, graph traversal, graphical 
models, finite-state machines, dynamic programming, 
and backtrack and branch-and-bound. These kernels 
have been characterized by their amenability to scal- 
ing on what are known as “hybrid” or “hierarchical” 
architectures. In such architectures, cores are prolif- 
erated vtithin a node of fixed memory and memory 
bandwidth resources in a shared-memory manner. The 
resulting nodes are then proliferated in a distributed- 
memory manner, being connected by a network that 
scales with the number of nodes. The fact that so many 
problems in computational science can be reduced to 
a relatively small set of similar tasks has profound 
implications for the scientific software environment 
and for research and development investment. It is 
possible for the work of a relatively small number 
of experts in computational mathematics and com- 
puter science to serve a relatively large number of 
scientists and engineers who are expert in something 
else, namely their fields of application. Vendors can 
invest in software libraries with well-defined interfaces 
(such as ones based on fast Fourier transforms or the 
basic linear algebra subroutines) that can access their 
hardware in custom ways. Researchers in numerical 
analysis can be confident, for instance, that work to 
improve a dense symmetric eigensolver has the poten- 
tial to be adopted by thousands of chemists solving 
the Schrodinger equation or that work to improve a 
sparse symmetric eigensolver could be used by thou- 
sands of mechanical engineers analyzing vibrational 
modes. The list of “dwarfs/dwarves” is binned into 
fairly broad categories above and many subcategoriza- 
tions are required in practice to arrive at a problem 
specification that leads directly to the selection of soft- 
ware. Many of the dwarfs do not yet have optimal 
implementations or even weakly scalable implemen- 
tations. These dwarfs make good targets for ongoing 
mathematical research. 

4.5 Polyalgorithms 

While it may be possible to describe many computa- 
tional science applications with reference to a rela- 
tively small number of kernels, each such specifica- 
tion can typically be approached by a large number of 
algorithms. A polyalgorithm is a set of algorithms that 
can accomplish the same task(s), together with a set 
of rules for selecting among the members of the set, 
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based on the characteristics of the input, the require- 
ments of the output, and the properties of the exe- 
cution environment. Sometimes, one algorithm offers 
superior performance in terms of the number of opera- 
tions, the pattern of memory references, running time, 
or other metrics over a wide range of inputs, output 
requirements, and environment, which renders the con- 
cept trivial. More typically, each algorithm has a niche 
in which it is superior to all others, and there is no 
transitivity with respect to performance metrics across 
all possible uses. Problem size (discrete dimension) 
is a fundamental input characteristic. A method that 
requires, asymptotically, 0(N 2 ) operations may beat a 
method that requires only O(N) or 0(N logiV) oper- 
ations if AT is small and the constant hidden by the 
order symbol is small compared with the constant for 
an asymptotically superior method. Between two such 
algorithms there is therefore a crossover point in N that 
dictates selection between them. Other input character- 
istics that are often important include sparsity, symme- 
try, definiteness, heterogeneity, and isotropy. Output 
requirements may vary dramatically from one applica- 
tion to another. For instance, in an eigenanalysis sce- 
nario, are all eigenvalues required, or just the largest, 
the smallest, or those closest to an interior value? Are 
eigenvectors required? Environmental characteristics 
may include the number of processing elements, the 
availability of built-in floating-point precisions, the rel- 
ative cost of communication versus computation, and 
the size of various memories, caches, or buffers. It is 
therefore beneficial to have access to a variety of well- 
characterized approaches to accomplish each kernel 
task. 

4.6 Space-Time Trade-Offs 

Often, scientific computing algorithms are parametriz- 
able in ways that allow one to make space-time com- 
plexity trade-offs. If memory is limited, one may store 
less and compute more. If extra memory is available, 
one may store more and compute less. This princi- 
ple has many manifestations in practical implementa- 
tions and algorithmic tunings. Classical examples are 
the size of windows of past vectors to retain in a 
Krylov-style iterative method or a quasi-Newton iter- 
ative method. Contemporary examples emerge from 
the different latencies that different levels of computer 
memory possess from fast registers (which are essen- 
tially one cycle away), to caches of various levels, to 
main memory (thousands of cycles away), to disk files 


(millions of cycles away), to data distributed over the 
Internet. It may be cheaper to recompute some data 
than to store and retrieve them. On the other hand, as in 
modified Newton iterative methods, it may be cheaper 
and/or faster to reuse old data (in this case a Jaco- 
bian) than to constantly refresh them. In performance- 
oriented computing it may even be advantageous, on 
average, to assume values (on the basis of experience 
or rational assumption) for data that are not avail- 
able when they appear on the critical path as an input 
for one process and to “roll back” the computation 
when they arrive if they are too far from the assumed 
values. Under the heading of space-time trade-offs, 
one may also employ multiple precisions of floating- 
point arithmetic, as in classical iterative refinement 
[IV.10 §2] for linear systems. In this algorithm, one 
stores simultaneously and manipulates both high- and 
low-precision forms of some objects, doing the bulk of 
the arithmetic on the low-precision objects in order to 
refine the high-precision ones. 

4.7 Continuum-Discrete Duality 

In computational science it is frequently convenient to 
have multiple views of the same object. Nature is fun- 
damentally discrete but at a scale typically too fine for 
digital computers. We therefore work through ideal- 
ized continuous models, such as Navier- Stokes PDEs 
to calculate momentum transfer in fluids with small 
mean free paths between molecular collisions. The con- 
tinuous fields are then integrated over control vol- 
umes, parametrized by finite elements, or represented 
by pointwise values in finite differences, returning the 
model to finite cardinality, typically a much smaller car- 
dinality than that of the original molecular model. How- 
ever, adaptive error analyses for refining the spatial 
resolution or approximation order of the discretization 
reemploy the underlying continuous model. The opti- 
mal complexity solver known as the multigrid method 
makes simultaneous use of several different discrete 
approximations to the underlying continuum through 
recursive coarsening. Ordering multidimensional and 
possibly irregularly spaced coordinate data into the 
linear address space of computer memory may take 
advantage of a discrete analogue to Hilbert’s contin- 
uous space-filling curves. Thus, computational mathe- 
matics moves fluidly back and forth between the con- 
tinuous and the discrete. However, continuous proper- 
ties such as conservation or zero divergence are not 
necessarily preserved in the dual discrete representa- 
tion. The computational scientist needs to be aware of 
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operations, such as variational differentiation or appli- 
cation of boundary conditions, that may not commute 
with the switch of representations. Thus, for example, 
“discretize then optimize” may yield a different result 
from “optimize then discretize” (and famously often 
does). 

4.8 Numerical Conditioning 

Numerical conditioning is a primary concern of the 
computational scientist; it affects the accuracy that can 
be guaranteed by theory or expected in practice and 
the convergence rate of many algorithms. It is there- 
fore a key factor in the scientist’s choice of how pre- 
cise a floating-point arithmetic to employ. In computa- 
tional science applications, the condition number of a 
discrete operator is often related to the quality of the 
discretization in resolving multiscale features in the 
solution. The continuous operator may have an infinite 
condition number, so the greater the resolution, the 
worse the numerical condition number. The progres- 
sively worsening conditioning of elliptic operators, in 
particular, as they better resolve fine-wavelength com- 
ponents in the output, is a bane of iterative methods 
whose iteration count grows with ill-conditioning. The 
computational work required to solve the problem to a 
given accuracy grows superlinearly overall, linearly in 
the cost per iteration times a factor that depends on 
the condition number. Optimal algorithms must gener- 
ally possess a hierarchical structure that handles each 
component of the error on its own natural scale. 

4.9 The Nested-Loop Co-design Process 

The classic computational science development para- 
digm is reflected in figure 1, which is adapted from a 
figure in a report to the U.S. Department of Energy in 
2000. On the left is what could be called a “validation 
and verification” loop. On the right is a performance 
tuning loop. The left loop is the province of the com- 
putational mathematician, who may make convenient 
assumptions, such as that memory is flat, in deriving 
algorithms that deliver bounded error while minimiz- 
ing floating-point operations or memory capacity. The 
right loop is the province of computer scientists (and 
ultimately architects), who inherit a mathematical spec- 
ification of an algorithm and optimize its implementa- 
tion for power and runtime. At the time this scheme 
was introduced in aid of launching the unprecedent- 
edly interdisciplinary SciDAC program, these loops 
were primarily envisioned to be sequential. First the 


validation and verification loop would converge and 
then the performance tuning loop would, resulting in a 
computational tool for scientific discovery. “Codesign” 
is a classic concept from embedded systems that is 
increasingly being invoked in the general-purpose sci- 
entific computing context; it puts an outer loop around 
these algorithmic and performance loops based on 
the recognition that isolated design keeps significant 
performance and efficiency gains off the table. An 
unstructured grid PDE may not be a good discretiza- 
tion for exploitation of hybrid hardware with signif- 
icant SIMDization, for instance. Such feedback could 
lead to consideration of lattice Boltzmann methods in 
some circumstances, for example. 

4.10 From Computing to Understanding 

As hardware improves exponentially in performance, 
computational modeling has the luxury of becoming 
a science by means of which a simulated system is 
“poked” intelligently and repeatedly to reveal gener- 
alizable insight into behavior. Historically, computa- 
tional science research was driven by the gap between 
the capabilities of the hardware and the complexity 
of the systems intended for modeling. Effort was con- 
centrated on isolating a component of the overall sys- 
tem and improving computational capability thanks 
to improvements in algorithms, software, and hard- 
ware. This inevitably involves invoking assumptions for 
ignored coupling; for instance, an ocean model may 
be executed with simple assumptions for atmospheric 
forcing. Today, highly capable components are being 
reunified into complex systems that make fewer decou- 
pling compromises. For instance, ocean and atmo- 
spheric models drive each other across the interface 
between them by passing fluxes of mass, momentum, 
and energy in various forms. With increased computa- 
tional power and memory, these complex systems can 
be partially quantified with respect to uncertainty and 
then subjected to true experiments by being embedded 
into ensemble runs. “What if” questions can be posed 
and executed by controlling inputs. This progression is 
powered at every stage by new mathematics, and some- 
times, in turn, it generates new mathematical ques- 
tions. In building up capability, the focus is on reducing 
computational complexity. In coupling components, 
the focus is on balancing error and stability analy- 
sis, with a continued premium on complexity reduc- 
tion. Finally, the focus is on uncertainty quantification 
and the formulation of hypotheses from observation— 
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Figure 1 Diagram that launched the SciDAC program in 2000, courtesy of Thom Dunning Jr. 


hypotheses that may eventually be provable theoreti- 
cally or be subject to controlled experimental testing 
in the real world. 

4.11 The “Multis” 

The discussion of the cross-cutting themes of com- 
putational science can be conveniently and mnemon- 
ically encapsulated in the list of the “multis.” Scien- 
tific and engineering applications are typically multi- 
physics, multiscale, and multidimensional in nature. 
Algorithms to tackle them are typically hierarchical, 
using interacting multimodels, on multilevels of refine- 
ment, in multiprecisions. These then have to be imple- 
mented on multinode distributed systems, with mul- 
ticore nodes, and multiprotocol programming styles, 
all of which requires a multidisciplinary approach in 
which the mathematician plays a critical bridging role. 
We elaborate on multiphysics in the next section. 

4.12 Encapsulation in Software 

There have been many efforts to respond to the tar- 
gets presented by one or more of the “dwarfs” and 
the complexity of the applications and algorithms of 
computational science with well-engineered software 
packages. The Portable Extensible Toolkit for Scientific 
Computing (PETSc), a freely downloadable suite of data 
structures and routines from Argonne National Labo- 
ratory that has been in continuous development since 


1992, is an example that incorporates generic parallel 
programming primitives, matrix and vector interfaces, 
and solvers for linear, nonlinear, and transient prob- 
lems. PETSc emphasizes ease of experimentation. Its 
strong encapsulation design allows users to hierarchi- 
cally compose methods at runtime for coupled prob- 
lems. Several relatively recent components have sub- 
stantially improved composability for multiphysics and 
multilevel methods. For example, DMComposite man- 
ages distributed-memory objects for coupled problems 
by handling the algebraic aspects of gluing together 
function spaces, decomposing them for residual eval- 
uation, and setting up linear operators to act on the 
coupled function spaces. A matrix assembly interface 
is available that makes it possible for individual physics 
modules to assemble parts of global matrices without 
needing global knowledge or committing in advance to 
matrix format. The Fi el dSpl i t preconditioner solves 
linear block systems using either block relaxation or 
approximate block factorization (as in “physics-based” 
preconditioners for stiff waves or block precondi- 
tioners for incompressible flow). FieldSplit can be 
nested inside other preconditioners, including geomet- 
ric or algebraic multigrid ones, with construction of 
the hierarchy and other algorithmic choices exposed 
as runtime options. Recent enhancements to the TS 
component support implicit-explicit time integration 
schemes. Multiphysics applications that have employed 
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these capabilities include lithosphere dynamics, sub- 
duction and mantle convection, ice sheet dynamics, 
subsurface reactive flow, tokamak fusion, mesoscale 
materials modeling, and electrical power networks. 

5 Multiphysics Modeling 

A multiphysics system consists of multiple coupled 
components, each component governed by its own 
principle(s) for evolution or equilibrium, typically con- 
servation or constitutive laws. Multiphysics simulation 
is an area of computational science possessing mathe- 
matical richness. Coupling individual simulations may 
introduce stability, accuracy, or robustness limitations 
that are more severe than the limitations imposed by 
the individual components. A major classification in 
such systems is whether the coupling occurs in the 
bulk (e.g., through source terms or constitutive rela- 
tions that are active in the overlapping domains of the 
individual components) or whether it occurs over an 
idealized interface that is lower dimensional or over a 
narrow buffer zone (e.g., through boundary conditions 
that transmit fluxes, pressures, or displacements). Typ- 
ical examples of bulk-coupled multiphysics systems 
that have their own extensively developed literature 
include radiation with hydrodynamics in astrophysics 
(radiation-hydrodynamics, or “rad-hydro”), electricity 
and magnetism with hydrodynamics in plasma physics 
(magnetohydrodynamics), and chemical reaction with 
transport in combustion or subsurface flows (reac- 
tive transport). Typical examples of interface-coupled 
multiphysics systems are ocean-atmosphere dynam- 
ics in geophysics, fluid-structure dynamics in aero- 
elasticity, and core-edge coupling in tokamaks. Beyond 
these classic multiphysics systems are many others 
that share important structural features. 

The two simplest systems that exhibit the crux of 
a multiphysics problem are the coupled equilibrium 
problem 


Fi(m, M 2 ) = 0 , 

(1 a) 

F 2 ( M 1 , M 2 ) = 0 

(lb) 

and the coupled evolution problem 


dtu 1 = /i(ui,u 2 ), 

(2 a) 

9fM 2 = /2(Ul,U 2 ). 

(2 b) 


When (2 a)-( 2 b) is semidiscretized in time, the evo- 
lution problem leads to a set of problems that take 
the form (1 a)-( 1 b) and that are solved sequentially to 
obtain values of the solution u(t n ) at a set of discrete 


times. Here u refers generically to a multiphysics solu- 
tion, which has multiple components indicated by sub- 
scripts u = (mi, . . . ,«jv c ); the simplest case of N c = 2 
components is indicated here. 

Initially, we assume for convenience that the Jaco- 
bian J = d(F\,F 2 ) /d(ui, U 2 ) is diagonally dominant in 
some sense and that dFi/dui and 3F2/du2 are non- 
singular. These assumptions are natural in the case 
where the system arises from the coupling of two indi- 
vidually well-posed systems that have historically been 
solved separately. In the equilibrium problem, we refer 
to Fi and F 2 as the component residuals ; in the evolu- 
tion problem, we refer to f\ and / 2 as the component 
tendencies. 

The choice of solution approach for these coupled 
systems relies on a number of considerations. From 
a practical standpoint, existing codes for component 
solutions often motivate operator splitting as an expe- 
ditious route to a first multiphysics simulation, mak- 
ing use of the separate components. This approach, 
however, may ignore strong couplings between com- 
ponents. Solution approaches ensuring a tight coupling 
between components require smoothness, or continu- 
ity, of the nonlinear, problem-defining functions, F,-, 
and their derivatives. 

Classic multiphysics algorithms preserve the in- 
tegrity of the two uniphysics problems, namely, solv- 
ing the first equation for the first unknown, given the 
second unknown, and solving the second equation for 
the second unknown, given the first. Multiphysics cou- 
pling is taken into account by iteration over the pair of 
problems, typically in a Gauss-Seidel manner (see algo- 
rithm 1), linearly or nonlinearly, according to context. 
Here we employ superscripts to denote iterates. 

Algorithm 1 (Gauss-Seidel multiphysics coupling). 

Given initial iterate [u®, u 2 } 

for k = 1,2,... until convergence do 
Solve for v infill/, u^ 1 ) = 0; set u\ = v 
Solve for w inf 2 (iii,ii/) = 0; set u 2 = w 

end for 

Likewise, the simplest approach to the evolutionary 
problem employs a field-by-field approach in a way 
that leaves a first-order-in-time splitting error in the 
solution. Algorithm 2 gives a high-level description 
of this process that produces solution values at time 
nodes to < ti <■■■< tn- Here, we use the nota- 
tion u(t 0 ) u(tn) to denote discrete time steps. An 

alternative that staggers solution values in time is also 
possible. 
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Algorithm 2 (multiphysics operator splitting). 

Given initial values { u i ( to ) , U 2 ( to ) 1 

for n = 1, 2, . . . , N do 

Evolve one time step in 3 t wi+/i(ui,M2(tn-i)) = 0 
to obtain ui(t n ) 

Evolve one time step in d t 112 + f 2 (tti (t n ), u 2 ) = 0 
to obtain u 2 (t n ) 

end for 


If the residuals or tendencies and their deriva- 
tives are sufficiently smooth and if one is willing 
to write a small amount of solver code that goes 
beyond the legacy component codes, a good algorithm 
for both the equilibrium problem and the implicitly 
time-discretized evolution problem is the Jacobian-free 
Newton-Krylov method (see below). Here, the problem 
is formulated in terms of a single residual that includes 
all components in the problems, 


F(u) = 


Fi(ui,u 2 ) 

F 2 (Ul,U 2 ) 


= 0, 


(3) 


where u = (u\,u 2 ). The basic form of newton’s 
method [11.28] to solve (3), for either equilibrium or 
transient problems, is given by algorithm 3. Because of 
the inclusion of the off-diagonal blocks in the Jacobian, 
for example, 


- dF\ 

3Fi - 

3ui 

du 2 

3 F 2 

3 F 2 

_3ui 

du 2 - 


Newton’s method is regarded as being “tightly cou- 
pled.” 


Algorithm 3 (Newton’s method). 

Given initial iterate u° 
for k = 1 , 2 ,... until convergence do 
Solve J(u k ~ l )5u = - F(u k ~ 1 ) 
Update u k = u k ~ l + 5u 

end for 


The operator and algebraic framework described 
here is relevant to many divide and conquer strategies 
in that it does not “care” (except in the critical matter 
of devising preconditioners and nonlinear component 
solvers for good convergence) whether the coupled 
subproblems are from different equations defined over 
a common domain, the same equations over different 
subdomains, or different equations over different sub- 
domains. The general approach involves iterative cor- 
rections within subspaces of the global problem. All the 
methods have in common an amenability to exploiting 


a “black-box” solver philosophy that amortizes exist- 
ing software for individual physics components. The 
differences are primarily in the nesting and ordering of 
loops and the introduction of certain low-cost auxiliary 
operations that transcend the subspaces. 

Not all multiphysics problems can be easily or reli- 
ably cast into these equilibrium or evolution frame- 
works, which are primarily useful for deterministic 
problems with smooth operators for linearization. 
In formulating multiphysics problems, modelers first 
apply asymptotics to triangularize or even diagonal- 
ize the underlying Jacobian as much as possible, prun- 
ing provably insignificant dependences but bearing 
in mind the conservative rule: “coupled until proven 
uncoupled.” One then applies multiscale analyses to 
simphfy further, eliminating stiffness from mecha- 
nisms that are dynamically irrelevant to the goals of 
the simulation. 

Perhaps the simplest approach for solving systems 
of nonlinear equations (3) is the fixed-point iteration, 
also known as the Picard or nonlinear Richardson itera- 
tion. The root-finding problem (3) is reformulated into a 
fixed-point problem u = G ( u ) by defining a fixed-point 
iteration function, for example, 

G(u) := u - <\F(u), 

where cx > 0 is a fixed-point damping parameter that is 
typically chosen to be less than 1. Fixed-point methods 
then proceed through the iteration 

u k+1 = G(u k ), 

with the goal that \\u k+1 - u k \\ < e. If the iteration 
function is a contraction, that is, if there exists some 
ye (0, 1) such that 

||G(n) - G(v)|| < y\\u - v|| 

for all vectors u and v in a closed set containing the 
fixed-point solution u* , then the fixed-point iteration 
is guaranteed to converge. This convergence is typi- 
cally linear, however, and can be slow even from a good 
initial guess. 

Newton’s method (algorithm 3) offers faster con- 
vergence, up to quadratic. However, direct computa- 
tion of 5n in algorithm 3 may be expensive for large- 
scale problems. Inexact Newton methods generalize 
algorithm 3 by allowing computation of 5u with an 
iterative method, requiring only that \\J(u k ^ 1 )8u + 
F {u k ^ ) || < Ek for some set of tolerances, Newton- 
Krylov methods are variants of inexact Newton meth- 
ods in which 5u is computed with a Krylov subspace 
method. This choice is advantageous because the only 
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information required by the Krylov subspace method 
is a method for computing Jacobian-vector products 
J(u k )v. A consequence of this reliance on only matrix- 
vector products is that these directional derivatives 
may be approximated and the Jacobian matrix J(u k ) is 
never itself needed. The Jacobian-free Newton-Krylov 
method exploits this approach, using a finite-difference 
approximation to these products: 

k F(u k + crv) - F(u k ) 

J(u K )v » , 

cr 

where cr is a carefully chosen differencing parameter 
and F is sufficiently smooth. This specialization facil- 
itates use of an inexact Newton method by eliminat- 
ing the need to identify and implement the Jacobian. 
Of course, the efficiency of the Jacobian-free Newton- 
Krylov method depends on preconditioning the inner 
Krylov subspace method, and the changes in the Jaco- 
bian as nonlinear iterations progress place a premium 
on preconditioners with low setup cost. Implemen- 
tations of inexact Newton methods are available in 
several high-performance software libraries, including 
PETSc, which was mentioned above. 

Fixed-point and inexact Newton methods can be used 
to solve multiphysics problems in a fully coupled man- 
ner, but they can also be used to implement coupling 
strategies such as algorithms 1 and 2. With an implicit 
method for one of the components, one can directly 
eliminate it in a linear or nonlinear Schur complement 
formulation. The direct elimination process for u\ in 
the first equation of the equilibrium system (1 a), given 
u 2 , can be symbolically denoted 

Mi = G {y>2 ) , 

with which the second equation (1 b) is well defined in 
the form 

F2(G{U2),U2) = 0. 

Each iteration thereof requires subiterations to solve (in 
principle to a high tolerance) the first equation. Unless 
the first system is much smaller or easier than the sec- 
ond, this is not likely to be an efficient algorithm, but 
it may have robustness advantages. 

If the problem is linear, 

El (Ml, M2 ) = fi - AnUl - A 12 M 2 ,] 

h (4) 

F 2 (Ml , M 2 ) = f2 - A 21 U 1 - A 22 M 2 J 

then F 2 ( G ( m 2 ) , M 2 ) = 0 involves the traditional Schur 
complement 

S = A 22 - A 21 A 1 IA 12 . 


If the problem is nonlinear and if Newton’s method is 
used in the outer iteration, the Jacobian 

dp2 3F 2 3G dF2 
dw2 dui du2 SU 2 

is, to v\ithin a sign, the same Schur complement. 

Similar procedures can be defined for the evolution 
problem, which, when each phase is implicit, becomes 
a modified root-finding problem on each time step 
with the Jacobian augmented with an identity or mass 
matrix. 

If the types of nonlinearities in the two compo- 
nents are different, a better method may be the non- 
linear Schwarz method or the additive Schwarz pre- 
conditioned inexact Newton (ASPIN) method. In non- 
linear Schwarz, one solves component subproblems 
(by Newton or any other means) for componentwise 
corrections, 

Fi(Uj _1 + 8u\, u k ^) = 0, 

F2 ( M-i 1 , M2 1 + Su 2 ) = 0 , 

and uses these legacy componentwise procedures im- 
plicitly to define modified residual functions of the two 
fields: 

G 1 (Mi, M 2 ) = 5mi, 

G2 (Ml , M 2 ) = Su2- 

One then solves the modified root-finding problem 

Gi(Mi,M2) = 0, 

G 2 (M 1 , M2 ) = 0. 

If one uses algorithm 3 to solve the modified problem, 
the Jacobian is 


0 s 

co |co 

1 

3Gi ' 
3M2 



I 

( 3Fi 
V 3mi 

V 1 bf 1 ' 

) 3m 2 

3 G 2 
_3mi 

3G 2 
3m 2 _ 


/ 3 F 2 
_V3m2 

r 1 3 f 2 

) 3mi 


I 


which clearly shows the impact of the cross-coupling 
in the off-diagonals. In practice, the outer Newton 
method must converge in a few steps for ASPIN to be 
worthwhile, since the inner iterations can be expen- 
sive. In nonlinear Schwarz, the partitioning of the global 
unknowns into Mi and M 2 need not be along purely 
physical lines, as it would be if they came from a pair 
of legacy codes. The global variables can be partitioned 
into overlapping subsets, and the decomposition can 
be applied recursively to obtain a large number of small 
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Newton problems. The key to the convenience of non- 
linear Schwarz is that the outer iteration is Jacobian- 
free and the inner iterations may involve legacy solvers. 
(This is also true for the Jacobian-free Newton-Krylov 
method, where the inner preconditioner may involve 
legacy solvers.) 

Error bounds on linear and nonlinear block Gauss- 
Seidel solutions of coupled multiphysics problems 
show how the off-diagonal blocks of the Jacobian enter 
the analysis. Consider the linear case (4) for which the 
Jacobian of the physics-blocked preconditioned system 
is 

I Arf An 
A22 A21 I 

Let the product of the coupling blocks be defined by the 
square matrix C = AjJ A 1 2 A 22 ^ 21 , the norm of which 
may be bounded as ||C|| ^ IIAJ 1 II HA 12 II HA 22 II II-A 21 II- 
Provided ||C|| ^ 1, any linear functional of the solution 
(mi, M 2 ) of (4) solved by the physics-blocked Gauss- 
Seidel method (algorithm 1) satisfies conveniently com- 
putable bounds in terms of residuals of the individual 
physics blocks of (4). The cost of evaluating the bound 
involves the action of the inverse uniphysics operators 
Aj-J and A 22 1 on vectors coming from the uniphysics 
residuals and the dual vectors defining the linear func- 
tionals of interest. The bounds provide confidence 
that the block Gauss-Seidel method has been iter- 
ated enough to produce sufficiently accurate outputs 
of interest, be they point values, averages, fluxes, or 
similar. The required actions of the inverse uniphysics 
operators are a by-product of an implicit method for 
each phase. The nonlinear case is similar, except that 
the Jacobian matrices from each physics phase that 
comprise C may change on each block Gauss-Seidel 
iteration. 

Preconditioning is essential for efficiency whenever a 
Krylov subspace method is used. While there has been 
considerable success in developing black-box algebraic 
strategies for preconditioning linear systems, precon- 
ditioners for multiphysics problems generally need to 
be designed by hand. 

A widely used approach is to use a block-diagonal 
approximation of the Jacobian of the system. Improved 
performance can generally be achieved by captur- 
ing the strong couplings in the preconditioner and 
leaving the weak couplings to be resolved by the 
outer Krylov/Newton iterations. Such approaches gen- 
erally lead to identification of a Schur complement 
that embodies the important coupling, and their suc- 


cess relies on judicious approximation of the Schur 
complement. 

6 Anatomy of a Large-Scale Simulation 

We illustrate many issues in computational science by 
discussing the multiphysics application of turbulent 
reacting flows. Fittingly, the first ACM-SIAM Prize in 
Computational Science and Engineering was awarded 
in 2003 to two scientists at Lawrence Berkeley National 
Laboratory who addressed this difficult problem by 
integrating analytical and numerical techniques. Apart 
from the scientific richness of the combination of fluid 
turbulence and chemical reaction, turbulent flames are 
at the heart of the design of equipment such as recip- 
rocating engines, turbines, furnaces, and incinerators 
for efficiency and minimal environmental impact. Effec- 
tive properties of turbulent flame dynamics are also 
required as model inputs in a broad range of larger- 
scale simulation challenges, including fire spread in 
buildings or wildfires, stellar dynamics, and chemical 
processing. 

The first step in such simulations is to specify the 
models for the reacting flows. The essential feature of 
reacting flows is the set of chemical reactions at a pri- 
ori unknown locations in the fluid. As well as chemical 
products, these reactions produce both temperature 
and pressure changes, which couple to the dynamics 
of the flow. Thus, an accurate description of the reac- 
tions is critical to predicting the shape and properties 
of the flame. Simultaneously, it is the fluid flow that 
transports the reacting chemical species to the reac- 
tion zone and transports the products of the reaction 
and the released energy away from the reaction zone. 
The location and shape of the reaction zone are deter- 
mined by a delicate balance of species, energy, and 
momentum fluxes and are highly sensitive to how these 
fluxes are specified at the boundaries of the compu- 
tational domain. Turbulence can wrinkle the reaction 
zone, giving it a much greater area than it would have 
in its laminar state, without turbulence. Hence, incor- 
rect prediction of turbulence intensity may under- or 
over-represent the extent of reaction. 

From first principles, the reactions of molecules are 
described by the schrodinger equation [III. 26] and 
fluid flow is described by the navier-stokes equa- 
tions [111.23]. However, each of these equation sets 
is too difficult to solve directly, so we must rely on 
approximations. These approximations define the com- 
putational models used to describe the flame. For 
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the combustion of methane, a selected set of eighty- 
four reactions involves twenty-one chemical species 
(methane, oxygen, water, carbon dioxide, and many 
trace products and reaction intermediates). A particu- 
larly useful approximation of the compressible Navier- 
Stokes model allows detailed consideration of the con- 
vection, diffusion, and expansion effects that shape the 
reaction, but it is filtered to remove sound waves, which 
pose complications that do not measurably affect com- 
bustion in the laboratory regime of the flame. The selec- 
tion of models such as these requires context-specific 
expert judgment. For instance, in a jet turbine simula- 
tion, sound waves might represent a nonnegligible por- 
tion of the energy budget of the problem and would 
need to be modeled. However, in a jet, it might be valid 
(and very cost-effective) to use a less intricate chemi- 
cal reaction mechanism to answer the questions con- 
cerning flame stability, thrust, noise, and the like. If 
atmospheric pollution were the main subject of the 
investigation, on the other hand, an even more detailed 
chemical model involving more photosensitive trace 
species might be required. (Development of the chem- 
istry model and validation of the predictions of the 
simulation make an interesting scientific story, but one 
that is too long to be told here.) 

The computations that led to the first ACM-SIAM 
Prize in Computational Science and Engineering were 
performed on an IBM SP supercomputer named “Sea- 
borg” at the National Energy Research Scientific Com- 
puting Center, using as many as 2048 processors. 
At the time, they were among the most demanding 
combustion simulations ever performed, though they 
have since been eclipsed in terms of both complex- 
ity and computational resources by orders of magni- 
tude. The ability to perform them was not, however, 
merely (or even mainly) a result of improvements in 
computer technology. Improvements in algorithm tech- 
nology were even more instrumental in making the 
computations feasible. As mentioned above, mathe- 
matical analysis was used to reformulate the equations 
describing the fluid flow so that high-speed acoustic 
transients were removed analytically while compress- 
ibility effects due to chemical reactions were retained. 
The resulting model of the fluid flow was discretized 
using high-resolution finite-difference methods, com- 
bined with local adaptive mesh refinement by which 
regions of the finite-difference grid were automatically 
refined or coarsened in a structured manner to maxi- 
mize overall computational efficiency. The implementa- 
tion used an object-oriented, message-passing software 


(a) (b) 


Figure 2 A comparison of (a) the experimental particle 
image velocimetry crossview of the turbulent V flame and 
(b) a simulation. Reproduced from “Numerical simulation 
of a laboratory-scale turbulent V-flame” by J. B. Bell et al. 
(Proceedings of the National Academy of Sciences of the USA 
102:10,009 (2005)). 

framework that handled the complex data distribu- 
tion and dynamic load balancing needed to effectively 
exploit thousandfold parallelism. The data analysis 
framework used to explore the results of the simulation 
and create visual images that lead to understanding is 
based on recent developments in “scripting languages” 
from computer science. The combination of these algo- 
rithmic innovations reduced the computational cost 
by a factor of 10 000 for the same effective resolu- 
tion compared with a standard uniform-grid approach. 
(See figure 2 for a side-by-side comparison of a slot- 
ted V-flame experiment and a realization of a three- 
dimensional time-dependent simulation prepared for 
the same conditions.) 

As the reaction zone shifts, the refinement auto- 
matically tracks it, adding and removing resolution 
dynamically. Unfortunately, the adaptivity and mathe- 
matical filtering employed to save memory and reduce 
the number of operations complicate the software and 
throw the execution of the code into a regime of com- 
putation that processes fewer operations per second 
and uses thousands of processors less uniformly than 
a “business as usual” algorithm would. As a result, the 
scientifically effective simulation runs at a small per- 
centage of the theoretical peak rate of the hardware. In 
this case, a simplistic efficiency metric like “percentage 
of theoretical peak” is misleading. 

Software of this complexity and versatility could 
never be assembled in the traditional mode of devel- 
opment whereby individual researchers with sepa- 
rate concerns asynchronously toss software written to 
a priori specifications “over the transom.” Behind any 
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simulation as complex as this stands a tightly coordi- 
nated, vertically integrated multidisciplinary team. 

7 The Grand Challenge of 
Computational Science 

Many “grand challenges” of computational science, in 
simulating complex multiscale multiphysics systems, 
await the increasing power of high-performance com- 
puting and insight and innovation in mathematics. It 
can be argued that any system that can deterministi- 
cally and reproducibly simulate another system must 
be at least as “complex” as the system being simulated. 
Therefore, the high-performance computational envi- 
ronment itself is the most complex system to model. 
The grandest challenge of computational science is 
making this complex simulation system sufficiently 
manageable to use that scientists who are expert in 
something else (e.g., combustion, reservoir modeling, 
magnetically confined fusion) can employ it effectively. 
In over six decades of computational science, humans 
have so far bridged the gap between the complexity 
of the system being modeled and machine complex- 
ity, with von Neumann as an exemplar of an indi- 
vidual who made contributions at all levels, from the 
modeling of the physics to the construction of com- 
puter hardware. After all of the components of the 
chain are delivered, the remaining challenge is the 
vertical integration of models and technology, lead- 
ing to a hardware-software instrument that can be 
manipulated to perform the theoretical experiments of 
computational science. 

Glossary 

Computational science (and engineering): the verti- 
cally integrated multidisciplinary process of exploring 
scientific hypotheses using computers. 

Computational “X” (where “X” is a particular natural or 
engineering science, such as physics, chemistry, biol- 
ogy, geophysics, fluid dynamics, structural mechanics, 
or electromagnetodynamics): a specialized subset of 
computational science concentrating on models, prob- 
lems, techniques, and practices particular to problems 
from “X.” 

Scientific computing: an intersection of tools and tech- 
niques in the pursuit of different types of computa- 
tional “X.” 

Strong/weak coupling of physical models: strong 
(respectively, weak) coupling refers to strong (respec- 
tively, weak) interactions between different physics 


models that are intrinsic in a natural process. Math- 
ematically, the off-diagonal blocks of the Jacobian 
matrix of a strongly coupled multiphysics model may 
be full or sparse but contain relatively large entries. 
In contrast, a weakly coupled multiphysics model con- 
tains relatively small off-diagonal entries. 

Tight/loose coupling of numerical models: tight (re- 
spectively, loose) coupling refers to a high (respec- 
tively, low) degree of synchronization of the state vari- 
ables across different physical models. A tightly cou- 
pled scheme (sometimes referred to as a strongly cou- 
pled scheme) keeps all the state variables as synchro- 
nized as possible across different models at all times, 
whereas a loosely coupled scheme might allow the state 
variables to be shifted by one time step or be staggered 
by a fraction of the time step. 
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IV. 17 Data Mining and Analysis 

Chandrika Kamath 


1 The Need for Data Analysis 

Technological advances are enabling us to acquire ever- 
increasing amounts of data. The amount of data, now 
routinely measured in petabytes, is matched only by 
its complexity, with the data available in the form of 
images, sequences of images, multivariate time series, 
unstructured text documents, graphs, sensor streams, 
and mesh data, sometimes all in the context of a sin- 
gle problem. Mathematical techniques from the fields 
of machine learning, optimization, pattern recognition, 
and statistics play an important role as we analyze 
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Figure 1 A schematic showing the 
overall process of data analysis. 


these data to gain insight and exploit the information 
present in them. 

The types of analysis that can be done are often as 
varied as the data themselves. One of the best-known 
examples occurs in the task of searching, or informa- 
tion retrieval, where we want to identify documents 
containing a given phrase, retrieve images similar to 
a query image, or find the answer to a question from a 
repository of facts. In other analysis tasks, the focus is 
prediction, where, given historical data that relate input 
variables to output variables, we want to determine the 
outputs that result from a given set of inputs. For exam- 
ple, given examples of malignant and benign tumors in 
medical images, we may want to build a model that will 
identify the type of tumor in an image. Or we could 
analyze credit card transactions to build models that 
identify fraudulent use of a card. In other problems 
the focus is on building descriptive models. For exam- 
ple, we want to group chemicals with similar behavior 
to understand what might cause that behavior, or we 
may cluster users who like similar movies so we can 
recommend other movies to them. 

Data mining is the semiautomatic process of dis- 
covering associations, anomalies, patterns, and statis- 
tically significant structures in data. At the risk of over- 
simplification, we can consider data mining, or data 
analysis, to have three phases, as shown in figure 1: the 
representation of the data, dimension reduction, and 
the identification of patterns. 

First, the raw data are processed to bring them into a 
form more suitable for analysis. This is especially true 
when the data are in the form of images, text docu- 
ments, or other formats that require us to identify the 
objects in the raw data and find appropriate represen- 
tations for them. A common way of representing data 
is in the form of a table (see table 1) in which each 
row represents a data item (which could be a docu- 
ment, an image, a galaxy, or a chemical compound) and 
in which the columns are the features, or descriptors, 
that characterize the data item. These features are cho- 
sen to reflect the analysis task and could include the 
frequency of different words in a document, the tex- 


Table 1 Table data, with each data item Ii characterized 
by d features, and potentially an output Oi. 
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ture of an image, the shape of the objects in an image, 
or the properties of the atomic species that constitute 
a chemical compound. Sometimes a data item may be 
associated with an output variable, which could be dis- 
crete (e.g., indicating if a transaction is valid or fraudu- 
lent) or continuous (such as the formation enthalpy of 
a chemical compound). This table, viewed as a matrix, 
could be sparse if not all features are available for a 
given data item, for example, words that do not occur 
in a document. 

Each data item in the table can be considered as a 
point in a space spanned by the features; if the num- 
ber of features is high, that is, the point is in a high- 
dimensional space, it can make the analysis challeng- 
ing. This requires dimension reduction techniques to 
identify the key features in the data. 

The final step of pattern recognition is often seen as 
the focus of analysis. However, its success is strongly 
dependent on how well the previous steps have been 
performed. Not identifying the data items correctly 
or representing them using inappropriate features can 
result in inaccurate insights into the data. 

The process of data analysis is iterative: the represen- 
tation of the data is refined and the dimension reduc- 
tion and pattern recognition steps repeated until the 
desired accuracy is obtained. Data analysis is also inter- 
active, with the domain experts providing input at each 
step to ensure the relevance of the process to the prob- 
lem being addressed. Depending on the type of data 
being analyzed and the problem being addressed, these 
three phases of data representation, dimension reduc- 
tion, and pattern recognition can be implemented using 
a host of mathematical techniques. In the remainder of 
this article we review some of these techniques, iden- 
tify their relationship with methods in other domains, 
and discuss future trends. 


2 History 

Data analysis has a long history. We could say that it 
started with the ancient civilizations, which, observing 
the motions of celestial objects, identified patterns that 
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led to the laws of celestial mechanics. Some of the early 
ideas in statistics arose when governments collected 
demographic and economic data to help them manage 
their populations better, while probability played an 
important role in the study of games of chance and the 
analysis of random phenomena. The held of statistics, 
as we know it today, goes back at least a couple of hun- 
dred years. By modern standards, the data sets being 
analyzed then were tiny, composed of a few hundred 
numbers. However, given that this analysis was done 
before the advent of computers, it was a remarkable 
achievement. The field has evolved considerably since 
then, prompted by new sources of data, a broader range 
of questions being addressed through the analysis, and 
different fields contributing different insights into the 
data. 

The recent surge in both the size and the complexity 
of data started in the 1970s as technology enabled us 
to generate, store, and process vast amounts of data. 
Sensors provided some of the richest sources of data, 
ranging from those used in medicine, such as positron 
EMISSION TOMOGRAPHY [VII.9] (PET) and MAGNETIC 
resonance imaging [VII.10 §4.1] (MRI) scanning, to 
sensors monitoring equipment, such as cars and sci- 
entific experiments, and sensors observing the world 
around us through telescopes and satellites. Comput- 
ers not only enabled the analysis of these data, they 
contributed to this data deluge as well. Numerical sim- 
ulations are increasingly being used to model complex 
phenomena in problems where experiments are infea- 
sible or expensive. The resulting spatiotemporal data, 
often in the form of variables at grid points, are ana- 
lyzed to gain insight into the phenomena. The Internet, 
connecting computers worldwide, is one of the newest 
sources of data, providing insight into how the world 
is connected. These data are generated by sensors that 
observe network traffic, by web services that monitor 
the online browsing patterns of users, and by the net- 
work itself, which changes dynamically as connections 
are made or broken. 

The tasks in data analysis have also evolved to handle 
the new types of data, the new questions being asked of 
the data, and the new ways in which users interact with 
the data. For example, network data, images, mesh data 
from simulations, and text documents are not in a form 
that can be directly used for pattern identification; they 
all have to be converted to an appropriate representa- 
tion first. The high dimensionality of many data sets 
has led to the development of dimension reduction 
techniques, which shed light on the important features 


of the objects in a data set. The desire to improve 
the accuracy of predictions by analyzing many differ- 
ent modalities of data simultaneously has led to data 
fusion techniques. More recently, a recognition that the 
analysis of the data should be closely coupled with their 
generation has resulted in the reemergence of ideas 
from the field of design of experiments. Further, as data 
analysis gains acceptance and the results are used in 
making decisions, techniques to address missing and 
uncertain features have become important, and tools 
to reason under uncertainty are playing a greater role, 
especially when the risks and rewards associated with 
the decisions can be so high. 

Along v\ith these new sources of data and the novel 
questions that are being asked of the data, different 
fields have started to contribute to the broad area 
of data analysis. In addition to statistics, advances in 
the field of pattern recognition have focused on pat- 
terns in signals and images, while work in artificial 
intelligence and machine learning have tried to mimic 
human reasoning, and developments in rule-based sys- 
tems have been used to extract meaningful rules from 
data. The more recent held of data mining was orig- 
inally motivated by the need to make better use of 
databases, though it too is evolving into specialized 
topics such as web mining, scientific data mining, text 
mining, and graph mining. Domain scientists have also 
contributed their ideas for analyzing data. As a result, 
data analysis is replete with examples where the same 
algorithm is known by different names in different 
domains, and a technique proposed in one domain is 
found to be a special case of a technique from a dif- 
ferent domain. The process of inference from data has 
benefited immensely from the contributions of these 
different fields, with each providing its own perspective 
on the data and, in the process, making data analysis 
a held rich in the diversity of mathematical techniques 
employed. 

3 Data Representation 

Many pattern recognition algorithms require the data 
to be in a tabular form, as shown in table 1. However, 
this is rarely the form in which the data are given to 
a data analyst, especially when the raw data are avail- 
able as images, time series, documents, or network 
data. Converting the data into a representation suit- 
able for analysis is a crucial hrst step in any analysis 
endeavor. However, this task is difficult as the repre- 
sentation is very dependent on the problem, the type of 
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data, and the quality of the data. Domain- and problem- 
dependent methods are often used, and the represen- 
tation is iteratively refined until the analysis results are 
acceptable. 

Consider the problem of improving the quality of 
images, which are often corrupted by noise, making it 
difficult to identify the objects in the data. A simple 
solution is to apply a mean filter. In the case of a 3 x 3 
filter, this would imply convolving an image with 
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thus replacing each image pixel by the weighted aver- 
age of the pixel and its eight neighbors. While a mean 
filter assigns equal weight to all pixels within its sup- 
port, a two-dimensional Gaussian filter with standard 
deviation cth, 

It'd (Pi, Pj ) = ~ “ 2 eX P ( “ — ^ )’ 

iTUT^ V 2 ) 

assigns a lower weight when the pixel pj is farther 
from the pixel p* . A more recent development, the bilat- 
eral filter, applies this idea in both the domain and 
the range of the neighborhood around each pixel. Each 
weight in the filter is the normalized product of the 
corresponding weights ic’d and tty: 
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where cr r is the standard deviation of the range fil- 
ter and 5(I(pi) - I(pj)) is a measure of the distance 
between the intensities at pixels p* and pj. When the 
intensity difference is high— near an edge of an object, 
for example — less smoothing is done, making it easier 
to identify the edge in the next step of the analysis. 

In 1987 Perona and Malik noted that convolving an 
image with the Gaussian filter is equivalent to solving 
the diffusion equation 
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where I(x,y,t ) is the two-dimensional image I(x,y ) 
at time t = 0. 5cr,j and l(x,y,0) is the original image. 
This allowed the adoption of ideas from partial differ- 
ential equations (PDEs) to address the ill-posedness of 
the original formulation and to enhance de-noising by 
making suitable choices for the coefficients of the PDEs 
in order to restrict smoothing near the edges of objects 
in an image. 

There has been similar evolution in the realm of seg- 
mentation techniques for identifying objects in images. 


Starting v\lth the idea that there is a sharp gradient in 
the intensity at the boundary of an object, we can use a 
simple filter, such as the Sobel operators in the x- and 
y- dimensions, 
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to obtain the magnitude and orientation of the gradi- 
ent. This basic idea is used in the Canny edge detec- 
tor, which, despite its simplicity, performs quite well. 
It starts by smoothing the image using a Gaussian filter 
and then applies the Sobel operators. The edges found 
are thinned to a single pixel by suppressing values that 
are not maximal in the gradient direction. Finally, a two- 
parameter hysteresis thresholding retains both strong 
edge pixels and moderate-intensity edge pixels that are 
connected to the strong edge pixels. This results in a 
robust and well-localized detection of edges, ffowever, 
in regions of low contrast, the gradient detectors may 
fail to identify any edge pixels, leading to edges that are 
not closed contours. To address this drawback, we need 
to use techniques such as snakes and implicit active 
contours, which again blend ideas from image process- 
ing and PDEs. In turn, this allows us to identify objects 
in mesh data from simulations by using the PDE version 
of a segmentation algorithm. 

Other segmentation techniques, referred to as re- 
gion-growing methods, identify objects in images by 
exploiting the fact that the pixels in the interior of 
an object are similar to each other. Starting with the 
highest intensity pixel, a similarity metric is used to 
grow a region around it. The process is repeated with 
the highest intensity pixel among the remaining ones, 
until all pixels are assigned to a region. Cleanup is 
often required to merge similar regions that are adja- 
cent to each other or to remove very small regions. This 
idea of grouping data items based on their similarity 
is referred to as clustering and is described further in 
section 5.3. These clustering techniques can be applied 
to images and mesh data by defining a similarity met- 
ric that takes into account both the spatial locations of 
and the values at the pixels or mesh points. 

Once we have identified objects in images, or se- 
quences of images, we can represent them using 
features such as shape, texture, size, and various 
moments. There are different ways in which many of 
these features can be defined. For example, for objects 
in images, the shape feature can be represented as a 
linear combination of two-dimensional basis functions 
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such as shapelets or the angular radial transform 
described in the MPEG-7 multimedia standard. The 
coefficients of the linear combination would be the 
features representing the object. Alternatively, we can 
focus on the boundary of an object and define the shape 
features as the coefficients of a Fourier series expansion 
of the curve that describes the boundary. 

The features used to represent the objects in the data 
are very dependent on the problem and the type of data. 
It is often challenging to identify and extract features 
for some types of data, such as images or mesh data. 
Not only must the features characterize the objects 
in these data, they must also be invariant to rotation, 
translation, and scaling, as the patterns in the objects 
are often invariant to such transformations of the data. 
Further, if we are analyzing many images of varying 
quality, each with a variety of objects, it can be difficult 
to find a single set of algorithms, with a single set of 
parameters, that can be used to extract the objects and 
their features. 

In contrast, the task of feature extraction is relatively 
easy for other types of data. For example, in docu- 
ments it is clear that the features are related to the 
words or terms in the collection of documents being 
analyzed. As discussed in text mining [VII.24], a doc- 
ument can be represented using the term frequency of 
each term in the document, suitably weighted to reflect 
its importance in the document or collection of docu- 
ments. However, representing documents is not with- 
out its own challenges, such as the need for word-sense 
disambiguation for words with multiple meanings, the 
representation of tables or diagrams in documents, and 
analysis across documents in different languages. 

4 Dimension Reduction 

The number of features used to represent the data 
items in a data set can vary widely, ranging from a 
handful to thousands. These features can be numeric 
or categorical. Often, not all the features are relevant to 
the task at hand, or some features may be correlated, 
resulting in a duplication of information. Since extract- 
ing and storing features can be expensive, and irrele- 
vant features can adversely affect the accuracy of any 
models built from the data, it is often useful to identify 
the important features in a data set. This task of dimen- 
sion reduction maps a data item described in the high- 
dimensional feature space into a lower- dimensional 
feature space. 

There are two broad categories of dimension reduc- 
tion methods. In the first category, the features are 


transformed, linearly or nonlinearly, into a lower- 
dimensional space. A popular method for linear trans- 
formation is principal component analysis (PCA), which 
has been rediscovered in many domains, ranging from 
fluid dynamics, climate, signal processing, and linear 
algebra to text mining, with each contributing differ- 
ent insights into what such techniques reveal about 
the data. PCA is an orthogonal transform that converts 
the data into a set of uncorrelated variables called the 
principal components. The first principal component 
captures the largest variability in the data; each of the 
remaining principal components has the highest vari- 
ance subject to the constraint that it is orthogonal to all 
the previous principal components. PCA can be calcu- 
lated by an eigendecomposition of the data covariance 
or correlation matrix or by the application of the sin- 
gular value decomposition [11.32] to the data matrix. 
Usually, the data are first centered by subtracting the 
mean. 

A challenging issue in the practical application of 
transform-based methods is the selection of the dimen- 
sion of the lower- dimensional space. Often, a decision 
on the number of principal components to keep is made 
using a threshold on the percentage variance explained 
by the principal components that are retained. This 
essentially implies that the components that are dis- 
carded are considered as noise in the data. Thus, PCA 
can also be used to reduce the noise in data. 

The second category of dimension reduction tech- 
niques selects a subset of the original features. These 
feature selection techniques are applicable to prediction 
tasks, such as classification and regression, where each 
data item is also associated with an output variable. By 
identifying important features, such techniques enable 
us to make judicious use of limited resources available 
for measuring the features. They can also be invalu- 
able in problems where focusing on just the important 
variables makes it easier to gain insight into a complex 
phenomenon. 

There are two types of feature selection algorithms: 
filters, which are not coupled to the prediction task, and 
wrappers, which are coupled to the prediction task. 

As an example of a filter method, consider a simple 
problem where we have data items of two categories, 
or classes, and we need to find the features that are 
most useful in discriminating between these classes. 
For each feature we can build a histogram of the val- 
ues that the feature takes for each of the two classes. 
A large distance between the histograms for the two 
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Figure 2 The feature in (a) (with a larger distance between 
the histograms of feature values for two different classes 
(solid and dotted lines)) is considered more important than 
the feature in (b). The large overlap in (b) indicates that there 
is a larger range of values of the feature for which an object 
could belong to either of the two classes, making the feature 
less discriminating. 


symmetric version 

d(P,Q) = d KL (P,Q)+d KL (Q,P), 

which, unlike the Kullback-Leibler divergence, is a dis- 
tance metric, as d(P, Q) = d(Q_,P). Other distances can 
also be used, such as the Wasserstein metric, or, as it is 
referred to in computer science, the earth mover’s dis- 
tance. This intuitively named metric considers the two 
distributions as two ways of piling dirt on a region, with 
the distance being defined as the minimum cost of turn- 
ing one pile into the other. The cost is the amount of 
dirt moved times the distance over which it is moved. 

The wrapper approach to feature selection evaluates 
each subset of features based on the accuracy of pre- 
diction using that subset. Suppose we use a decision 
tree classifier (section 5.2) to build a predictive model 
for the two-class problem. The forward selection wrap- 
per method starts by selecting the single feature that 
gives the highest accuracy with the decision tree. It then 
chooses the next feature to add such that the subset 
of two features has the highest accuracy. Additional 
features are included to create larger subsets until 
the addition of any new feature does not result in an 
improvement in prediction accuracy. A backward selec- 
tion approach starts with all the features and progres- 
sively removes features. The wrapper approach is more 
computationally expensive than the filter approach as 
it involves evaluating the accuracy of a classifier. How- 
ever, it may identify subsets that are more appropriate 
when used in conjunction with a specific classifier. 

5 Pattern Recognition 


classes indicates that the feature is likely to be impor- 
tant, but if the histograms overlap, then the feature 
is unlikely to be helpful in differentiating between the 
classes (figure 2). The features can be ordered based on 
the distances between the histograms and the reduced 
dimension can be determined by placing an appropri- 
ate threshold on the distance. Any suitable measure for 
distances between histograms can be used. For exam- 
ple, we can first convert the histogram for each class 
into a probability distribution by normalizing it so that 
the area under the curve is 1. Then, we can either use 
the Kullback-Leibler divergence 

b 

dKL(P,Q ) = 




as a measure of the difference between two distribu- 
tions P and Q defined over b bins, or we can create a 


In pattern recognition tasks, we use the features that 
describe each data item to identify patterns among the 
data items. For example, if we process astronomical 
images to identify the galaxies in them and extract suit- 
able features for each galaxy, we are then able to use the 
features in several different ways, as discussed next. 

5.1 Information Retrieval 

In information retrieval, given a query data item, 
described by its set of features, the task is to retrieve 
other items that are similar in some sense to the query. 
Usually, the retrieved items are returned in an order 
based on the similarity to the query. The similarity met- 
ric chosen depends on the problem and the representa- 
tion of the data item. While Euclidean distance between 
feature vectors is a common metric, the cosine of the 
angle between the two vectors is used for document 
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similarity, and more complex distances could be used 
if the data items are represented as graphs. 

Data sets used in information retrieval are often mas- 
sive, and a brute-force search by comparing each item 
with the query item is expensive and will not result in 
the real-time turnaround often required in such analy- 
sis. Ideas from computer science, especially the use 
of sophisticated data structures that allow for fast 
nearest-neighbor searches, provide solutions to this 
problem. As these data structures perform better than 
a brute-force search only when the data items have a 
limited number of features, dimension reduction tech- 
niques are often applied to the data before storing them 
in these efficient data structures. 

5.2 Classification and Regression 

In some pattern recognition tasks we are given exam- 
ples of patterns, referred to as a training set, and the 
goal is to identify the pattern associated with a new 
data item. This is done by using the training set to cre- 
ate a predictive model that is then used to assign the 
pattern to the new data item. The items in the training 
set all have an output variable associated with them. If 
this variable is discrete, the problem is one of classifi- 
cation ; if it is continuous, the problem is one of regres- 
sion, where, instead of predicting a pattern, we need to 
predict the value of the continuous variable for the new 
data item. 

Techniques for regression and classification tend to 
be quite similar. The simplest such technique is the 
nearest-neighbor method, where, given the data item 
to which we need to assign an output, we identify the 
nearest neighbors to this item in the feature space and 
use the outputs of these neighbors to calculate the out- 
put for the query. In classification problems we can use 
the majority class, while in regression problems we can 
use a weighted average of the outputs of the neighbors. 
The neighbors are usually identified either by specify- 
ing a fixed number of neighbors k, which leads to the 
k -nearest-neighbor method, or by specifying the neigh- 
bors within a radius e, which leads to the f-nearest- 
neighbor method. While nearest-neighbor techniques 
are intuitively appealing, it is a challenge to set a value 
for fc or e, and the techniques may not work well if 
the query data item does not have neighbors in close 
proximity. 

More complex predictive models can be built from a 
training set using other classification algorithms. Deci- 
sion trees divide the data set into regions using hyper- 
planes parallel to the axis, such that each region has 
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Figure 3 A schematic of a decision tree for a data set with 
two features: (a) the decision boundaries between open and 
closed circles; (b) the corresponding decision tree. 


data items with the same output value (or similar ones). 
The tree is a data structure that is either a leaf, which 
indicates the output, or a decision node that specifies 
some test to be carried out on a feature, with a branch 
and subtree for each possible outcome of the test. For 
example, in figure 3 the open and closed circles are sep- 
arated by first making the decision FI < Al; data items 
that satisfy this constraint are next split using F2 < B2, 
while data items that do not satisfy the constraint are 
split using F2 < B 1 . Then, given a data item with certain 
values of features (FI, F2), we can follow its path down 
the decision tree and assign it the majority class of the 
leaf node at which the path ends. 

A key task in building a decision tree is the choice 
of decision at each node of the tree. This is obtained 
by considering each feature in turn, sorting the values 
of the feature, and evaluating a quality metric at the 
midpoints between consecutive values of the feature. 
This metric is an indication of the suitability of a split 
using that feature and midpoint value. The feature- 
midpoint value combination that optimizes the quality 
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metric across all features is chosen as the decision at 
the node. For example, in a two-class problem, a split 
that divides the data items at the node into two groups, 
each of which has a majority of items of one class, is 
preferable to a split where each of the two groups has 
an equal mix of the two classes. A commonly used met- 
ric is the Gini index, which finds the split that most 
reduces the node impurity, where the impurity for a c 
class problem with n data items is defined as follows: 



Impurity = (\T L \L C mi + \ T R \R GtB i)/n, 

where | and \Tr \ are the number of data items, \Li\ 
and \Ri\ are the number of data items of class i, and 
I, Gini and J^Gini are the Gini indices on the left- and right- 
hand side of the split, respectively. 

This simple idea of dividing the data items in feature 
space using decisions parallel to the axes appeared at 
nearly the same time in the literature relating to both 
machine learning and statistics, where it was referred 
to as the decision tree method and the classification 
tree method, respectively. Both communities have con- 
tributed to the advancement of the method. Ideas from 
machine learning have been used to create more com- 
plex decision boundaries by considering a linear combi- 
nation of the features in the decision at each node. The 
process of making the decisions at each node of the tree 
has been related to probabilistic techniques, providing 
insight into the behavior of decision trees and indicat- 
ing ways in which they can be improved. In particular, 
a statistical approach to decision tree modeling takes 
advantage of the trade-off between bias, which arises 
when the classifier underfits the data and cannot rep- 
resent the true function being predicted, and variance, 
which arises when the classifier overfits the data. 

Decision trees are not only simple to create and 
apply, they are also easy to interpret, providing use- 
ful insight into how a pattern is assigned to a data 
item. Further, the process used in making the decision 
at each node of the tree indicates which features are 
important in the data set. This makes the decision tree 
method a popular first choice for use in classification 
and regression problems. 

Other, more complex, algorithms in common use 
are neural networks and support vector machines. The 
simplest neural network, called a perceptron, takes a 


weighted combination of the input features and out- 
puts a 1 if the combination is greater than a threshold 
and 0 otherwise, thus acting as a two-class classifier. 
The task is to use the training set to learn the weights, 
an adaptive process that starts with random weights, 
applies the classifier, and modifies the weights when- 
ever the classification is in error. When the two classes 
are not linearly separable, that is, they cannot be sep- 
arated using a single line, we need to find the weights 
using optimization techniques [IV. 1 1 ] such as the 
gradient descent, conjugate gradient, quasi-Newton, or 
Levenberg-Marquardt methods. 

More complex neural networks include multiple lay- 
ers (in addition to the first layer, which represents the 
features) and, instead of a simple thresholding, use 
a continuously differentiable sigmoid function. This 
allows neural networks to model complex functions 
to differentiate among classes. However, they can be 
difficult to interpret, and designing the architecture of 
the network— which includes the number of layers, the 
number of nodes in a layer, and the connectivity among 
the nodes— can be a challenge. 

Another classification algorithm that uses optimiza- 
tion techniques is the support vector machine. This 
uses a nonlinear function to map the input into a 
higher- dimensional space in which the data are lin- 
early separable. There are many hyperplanes that lin- 
early separate the data. The optimal one, as indicated 
by statistical learning theory, is the maximum mar- 
gin solution, where the margin is the distance between 
the hyperplane and the closest example of each class. 
These examples are referred to as the support vec- 
tors. This solution is obtained by solving a quadratic 
programming problem [IV. 11] subject to a set of lin- 
ear inequality constraints. Support vector machines 
provide some insight into the process of classifica- 
tion through the identification of the support vec- 
tors, though it is a challenge to find the nonlinear 
transformation that linearly separates the data. 

A probabilistic approach to inference is provided by 
Bayesian reasoning, which assumes that quantities of 
interest are governed by probability distributions. This 
enables us to make decisions by reasoning about these 
probabilities in the presence of observed data. Bayes’s 
theorem, 

P(h | D) = P(D | h)P(h) /P(D), 

is a way of calculating the posterior probability P(h \ 
D), or the probability of hypothesis h given data D, 
from the prior probability of the hypothesis before we 
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have seen the data, P(h)\ the probability P(D \ h) of 
seeing the data D in a world where the hypothesis h 
holds; and the prior probability p(D) of seeing the data 
D without any knowledge of which hypothesis holds. 
Bayes’s theorem is essentially derived from the two dif- 
ferent ways of representing the probability of observ- 
ing both h and D together by writing it conditional 
either on D or h. 

Given the probabilistic insight provided by Bayes’s 
theorem, it is invaluable in tasks requiring reason- 
ing under uncertainty. The theorem also plays a role 
in the naive Bayes classifier. The new data item is 
assigned the most probable output value given the val- 
ues of its features. This most probable output value is 
obtained by applying Bayes’s theorem using probabili- 
ties derived from the training data, combined with the 
added simplifying assumption that the feature values 
are conditionally independent, given the output. 

A recent development in classification techniques is 
that of ensemble learning, where more than one clas- 
sifier is created from the training data. This is done 
by introducing randomization into the process, by cre- 
ating new training sets using sampling with replace- 
ment from the original training set, for example, or 
by selecting a random subset of features or samples 
to use in making the decision at each node. The class 
label assigned to a new data item is obtained by using 
a voting scheme on the labels assigned by each clas- 
sifier in the ensemble. While ensembles are compu- 
tationally more expensive as multiple classifiers have 
to be created, the ensemble prediction is often more 
accurate. 

5.3 Clustering 

Clustering is a descriptive technique used when the 
data items are not associated with an output value. 
It can be seen as complementary to classification. 
Instead of using the output value to identify bound- 
aries between items of different classes near each other 
in feature space, we use the features to identify groups 
of items in feature space that are similar to, or close to, 
each other (figure 4). Intuitively, if the features are cho- 
sen carefully, two data items with similar features are 
likely to be similar. Thus, if we know that some chem- 
ical compounds, with desirable properties, occur in a 
certain part of an appropriately defined feature space, 
other compounds near them might also have the same 
property. Or, if a customer is interested in a specific 
movie or book, others that are nearby in feature space 
could be recommended to them. 
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Figure 4 A schematic showing three 
clusters in a two-dimensional data set. 


Clustering techniques broadly fall into one of two cat- 
egories. In hierarchical methods we have two options: 
start from the bottom, with each item belonging to a 
singleton cluster, and merge the clusters, two at a time, 
based on a similarity metric; or start at the top, with all 
items forming a single cluster, and split that into two, 
followed by a split of each of the two clusters, and so 
on. In both methods, we need to determine how many 
clusters to select in the hierarchy. 

In contrast, partitional methods start with a prede- 
termined number of clusters and divide the data items 
into these clusters such that the items within a cluster 
are more similar to each other than they are to the items 
in other clusters. The best-known partitional algorithm 
is the k-means algorithm, which divides the data items 
into k clusters by first randomly selecting k cluster cen- 
ters. It then assigns each data item to the nearest clus- 
ter center. The items in each cluster are then used to 
update the cluster centers and the process continues 
until convergence. The k-means algorithm is very pop- 
ular and, despite its simplicity, works well in practice. 
The main challenge is to determine the number of clus- 
ters; this can be done by evaluating a clustering crite- 
rion such as the sum of squared errors of each data 
item from its cluster center and then selecting the k 
that minimizes this error. 

The k-means algorithm is related to algorithms in 
other domains; examples include Voronoi diagrams 
from computational geometry, the expectation-maxi- 
mization algorithm from statistics, and vector quanti- 
zation in signal processing. 

There is a third category of clustering algorithms that 
shares properties of both hierarchical and partitional 
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methods. These graph-based algorithms are related to 
the reordering methods used in direct solvers and the 
domain decomposition techniques used in PDEs. Essen- 
tially, they represent the data items as nodes in a graph, 
with the edges weighted according to the similarity 
between the data items. Clustering is then done either 
by using graph partitioning algorithms, where clusters 
are identified by minimizing the weights of the edges 
that are cut by the partitioning, or by using spectral 
graph theory and calculating the second eigenvector of 
the graph laplacian [V.12 §5], The nodes of the graph 
are split into two using the eigenvector components, 
and the process is then repeated on the two parts. 

6 The Future 

As data-mining algorithms have gained a foothold in 
application domains such as astronomy and remote 
sensing, other domains such as medical imaging, with 
similar types of data and problems, have started con- 
sidering data mining as a potential solution to their 
analysis problems. These new domains have in turn 
introduced new challenges to the analysis process and 
posed new questions, prompting the development of 
new algorithms. And then, when these algorithms are 
adopted by other communities, the cycle is continued. 

There are several key developments from the last 
decade that will drive the future of data mining and 
analysis. These developments have been motivated 
both by the types of data being analyzed and by the 
requirements of the analysis. 

There has been an explosion in the types of data 
being analyzed. While raw data in a tabular form is 
still the norm in some problem domains, other types 
of data— ranging from images and sequences of images 
to text documents, web pages, links between web 
pages, chemical compounds, deoxyribonucleic acid 
(DNA) sequences, mesh data from simulations, and 
graphs representing social networks— have prompted 
a new look at algorithms for finding accurate and 
appropriate representations of such data. This has also 
prompted greater interest in data fusion, where we ana- 
lyze multiple modalities of data. For example, we may 
consider PET scans and MRI scans, as well as clinical 
data, to treat a patient. Or we may consider not just 
the links between web pages but also the text, figures, 
and images on those pages. 

Adding to this complexity is the distributed nature 
of some of the data sets. For example, the use of sen- 
sor networks is becoming common in the surveillance 


and monitoring of experiments and complex systems. 
The networks may be autonomous and may reorganize, 
changing their positions in response to changes in their 
environment. In some problems, such as when moni- 
toring climate, the sensors may be stationary but geo- 
graphically scattered over the Earth. In all these cases, 
the data set to be analyzed is in several pieces that may 
not be collocated and the size of the data may make it 
difficult to collocate all the pieces in one place. There 
is a need for algorithms that analyze distributed data 
and build a model, or models, to represent the whole 
data set. In autonomous sensor networks there is an 
additional requirement that the amount of information 
exchanged between sensors be small, implying that it 
is the models and not the data that are exchanged. 
Such ideas are also relevant in telemedicine, as well as 
space exploration, where the distances and limited con- 
nectivity imply caution in determining what data and 
information are transferred. 

This brings up the need for effective compression 
techniques: both lossy, for problems which can toler- 
ate the loss of some information, and lossless, where it 
is important that the data be preserved in their orig- 
inal state. Compression will also play a role as we 
move to exascale computation, where it is becoming 
clear that the amount of data generated by simulations 
run on massively parallel machines will outpace the 
technology of the input-output system. 

The ever-increasing volume of data will increase the 
need for algorithms that exploit parallel computers; 
otherwise we run the risk of the data not being analyzed 
at all. In addition, such algorithms will be invaluable in 
problems where a real-time or near -real-time response 
is required. 

This short response time is required in the analy- 
sis of streaming data, where we analyze data as they 
are collected to identify untoward incidents, interesting 
events, or concept drift, where the statistical properties 
of the data change from one normal state to another. 
This is prompting the development of new algorithms 
that yield approximate results, as well as the creation 
of incremental versions of existing algorithms, where 
the models being built are constantly being updated to 
incorporate new data and discard old data. A particu- 
lar challenge in such analysis is the need to minimize 
false positives while ensuring that we do not miss any 
positives, especially in problems where the analysis is 
used in decision support. 

As data analysis comes to be viewed as one step in 
a closed system in which the data are analyzed and 
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decisions are made based on the analysis, without a 
human in the loop, there is an increasing need to under- 
stand the uncertainty in the analysis results. Uncer- 
tainty can arise as a result of several factors, includ- 
ing the quality of the raw data, the suitability of the 
chosen analysis algorithms for the data being studied, 
the sensitivity of the results to the parameters used 
in the algorithms, a lack of complete understanding 
of the process or system being analyzed, and so on. 
Thus, ideas from uncertainty quantification and rea- 
soning under uncertainty become important, especially 
when the risks associated with the decisions are high. 

This brings up the interesting issue of using data 
analysis techniques to influence the data we collect so 
that they better meet our needs. So far we have viewed 
data analysis as a process that starts once the data have 
been collected, but sometimes we have control over 
what data to generate, such as which data items to label 
to create a training set or which input parameters to use 
for an experiment or a simulation. We can borrow ideas 
from the field of design of experiments, both physical 
and computational, to closely couple the generation of 
the data with their analysis, hopefully improving the 
quality of that analysis. 

This borrowing of ideas from other domains will 
continue to increase as data analysis evolves to meet 
new demands. We have already seen how ideas from 
the traditional data analysis disciplines such as statis- 
tics, pattern recognition, and machine learning are 
being combined with ideas from fields such as image 
and video processing, mathematical optimization, nat- 
ural language processing, linear algebra, and PDEs to 
address challenging problems in data analysis. This 
cycle involving the development of novel algorithms 
followed by their application to different problem 
domains, which in turn generates new analysis require- 
ments, will continue, ensuring that data mining and 
analysis remain very exciting areas for the foreseeable 
future. Applied mathematics will remain a cornerstone 
of the field, not just by contributing to the extraction of 
insight into the data but also by playing a critical role 
in convincing the application experts that the insight 
obtained is based on sound mathematical principles. 

Further Reading 

Duda, R. 0., P. E. Hart, and D. G. Stork. 2000. Pattern 

Classification , 2nd edn. New York: Wiley-Interscience. 
Hastie, T., R. Tibshirani, and J. Friedman. 2009. The Ele- 
ments of Statistical Learning: Data Mining, Inference, and 

Prediction, 2nd edn. New York: Springer. 


Kamath, C. 2009. Scientific Data Mining: A Practical Perspec- 
tive. Philadelphia, PA: SIAM. 

Mitchell, T. 1997. Machine Learning. Columbus, OH: Mc- 
Graw-Hill. 

Sonka, M., V. Hlavac, and R. Boyle. 2007. Image Processing, 
Analysis, and Machine Vision, 3rd edn. Toronto: Thomson 
Engineering. 

Umbaugh, S. E. 2011. Digital Image Processing and Analysis, 
2nd edn. Boca Raton, FL: CRC Press. 

Witten, I., E. Frank, and M. Hall. 2011. Data Mining: Practical 
Machine Learning Tools and Techniques. Burlington, MA: 
Morgan Kaufmann. 


IV. 18 Network Analysis 

Esteban Moro 


1 Introduction 

Almost 300 years ago Euler posed the problem of find- 
ing a walk through the seven bridges of Konigsberg and 
laid the foundations of graph theory. Euler’s approach 
was probably one of the first examples of how to use 
network analysis to solve a real-world problem. Since 
then, network analysis has been used in many con- 
texts, from biology to economics and the social sci- 
ences. The general approach is to map the constituent 
units of the system and its interdependencies onto a 
network and analyze that network in order to under- 
stand and predict a given process. For example, buyers 
and items form a network of purchases that is used 
in recommendation engines; protein-protein interac- 
tion networks are used to unveil functionally coherent 
families of proteins; social relationships might reveal 
potential adoption of products and services by social 
contagion. 

The analysis of networks is an old subject in math- 
ematics, and it has its roots in many other disciplines, 
such as engineering, the social sciences, and computer 
science. However, in recent times the digital revolu- 
tion has brought with it easier access to detailed infor- 
mation about phenomena such as biological reactions, 
economic transactions, social interactions, and human 
movements. This has allowed us to study networks with 
an unprecedented level of detail. While this data revo- 
lution has produced an enormous boost in the models 
and applications of network theory, reaching unusual 
areas such as politics, crime, cooking, and so on, it 
has also challenged the available analytical methods 
because of the large size of real networks, which are 
typically made up of millions or even billions of nodes. 
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As a consequence, network analysis is a rapidly chang- 
ing field that attracts many researchers from diverse 
disciplines. This article discusses the mathematical 
concepts behind network analysis and also some of the 
main applications to real-world problems. 

2 Defi n itions 

The main mathematical object in network analysis is 
the network itself. A network (or a graph) is a pair G = 
(V , E) of a set V of vertices (nodes) together with a set 
E G V x V of edges (links). The numbers of nodes and 
edges are denoted by n = |V| and m = |Ej. Any edge 
is a relation between two vertices of V. Mathematically, 
a network can be represented by the adjacency matrix 
A, where has the value 1 if there is an edge between 
vertices i and j and 0 otherwise (see plate 5). If the edge 
is undirected then ay = aj i, while if the direction of the 
edge matters then the network is directed and A is in 
general nonsymmetric. Another variant is a weighted 
network, in which the nonzero elements of A represent 
the strength of the relationship: ay = i uy £ R. 

Although a complete description of the network is 
given by the adjacency matrix, we can obtain valu- 
able insights by measuring local (node-centric) and/or 
global properties of the graph. For example, the degree 
or connectivity, ki , of node i in an undirected network is 
the number of connections or edges that the node has. 
The graph density, D = 2m/(n(n - 1)), measures how 
sparse or dense a graph is by the ratio of the number 
of edges to the maximum possible number of edges 
for an n-node graph. In a complete graph every node 
is connected to every other node, so D = 1. A path 
in the network is a sequence of edges that connects a 
distinct sequence of nodes. The graph is connected if 
one can get from any node to any other node by fol- 
lowing a path. The distance 5(i,j ) between two nodes 
i and j is the length of the shortest path (also known 
as a geodesic) between them; if no such path exists, 
we set 5(i,j) = oo. Distance can be also defined tak- 
ing into account the weights of the edges. Finally, the 
diameter £ of a graph G is the maximum value of S(i,j ) 
over all pairs of i,j 6 V. Table 1 summarizes some key 
definitions and notation for easy reference. 

One of the most common interests in network analy- 
sis is in the “substructures” that may be present in 
the network. The neighborhood of a node (its ego net- 
work), which comprises itself and its k; neighbors, can 
be thought of as a substructure. More generally, any 
subset of the graph is called a subgraph G' = (E',V), 


where E ' £ E and V' £ V. The connected components 
of a graph are the subgraphs in which any two vertices 
are connected to each other by paths. We say that there 
is a giant component if there is a component that com- 
prises a large fraction of the nodes. A clique in an undi- 
rected graph G is a subgraph G' that is complete. A 
maximal clique in G is a clique of the largest possible 
size in G. Another measure of the “core” of the net- 
work is the k-core of G, which is defined as the maxi- 
mal connected subgraph of G in which all the vertices 
have degree at least fc. Equivalently, it is one of the con- 
nected components of the subgraph of G formed by 
repeatedly deleting all vertices of degree less than k. 
Finally, network motifs are recurrent and statistically 
significant subgraphs or patterns within a graph. More 
informal definitions of clusters in the network are used 
in the community-finding problem, in which a graph 
G is divided into a partition of dense subgraphs (see 
section 3.5). 

In some applications the static structure of networks 
or graphs is not enough to incorporate the dynamical 
nature of nodes and the interactions between them. 
For example, if we want to study the phone calls made 
between the customers of a mobile phone company, we 
will need to incorporate the fact that calls take place 
at different times. To this end, we can define tempo- 
ral graphs, Gt = ( Vt,Et ), in which we have a different 
set of nodes and edges for each t ^ 0. Time-varying 
graphs are affected by the temporal aspects of interac- 
tions, like causality, which add a new perspective to net- 
work analysis. Also, when multiple types of edges are 
present (multiplexity), we must consider a description 
in which nodes have a different set of neighbors in each 
layer (type). For example, one can consider each layer 
as the different types of social ties among the same set 
of individuals. Or one could picture each layer as rep- 
resenting a particular time in a temporal graph, or the 
layers could be the mathematical setup for the bipar- 
tite network of users and items in recommendation 
algorithms. 

3 Properties 

The German mathematician Dietrich Braess noted that 
adding extra roads to a traffic network can lead to 
greater congestion. This paradox shows one of the char- 
acteristics of network analysis: namely, that networks 
have emergent properties that cannot be explained sim- 
ply by the sum of their components. Here, we review 
some of the key properties of networks that are found 
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Table 1 The key definitions and notation for a graph G = (V,E) with n nodes and m edges. 


Degree or connectivity 

kt 

Number of connections or edges that node i has 

Diameter 

£ 

maxfjev 8(i,j) 

Degree distribution 

Pk 

Fraction of nodes in G that have degree k 

Average degree 

k 

X”=i kiln or Zkkpk 

Average degree of neighbors 

knn 

Average degree of (next-nearest) neighbors of a node in G 

Graph density 

D 

2 m/(n(n - 1)) 

Distance 

8(i,j ) 

Length of the shortest path between nodes i and j 

Clustering coefficient 

C 

Relative frequency of triangles to triplets in G 

Individual clustering coefficient 

Ci 

Relative frequency of triangles to triplets involving node i 


in real examples and incorporated in the most repre- 
sentative models. The properties cover most scales of 
network, from local features like connectivity or clus- 
tering to the study of a network’s modular structure 
(motifs, communities). 

3.1 Heterogeneity 

Many unique properties of networks are due to their 
heterogeneity. In the simplest approximation, networks 
are heterogeneous in the connectivity sense: a homoge- 
neous network is one in which each node i has the same 
degree k,- » k, where k is the average degree in the net- 
work. However, in real-world networks the distribution 
of connectivities is highly heterogeneous. In fact, one is 
very likely to find nodes with k, » k. A simple way to 
characterize heterogeneity is by using the degree dis- 
tribution of the network, pk, the fraction of nodes in 
the network that have degree k. In a homogeneous net- 
work, pk peaks around the average value k. However, 
real-world networks are usually heavily skewed, with a 
long tail of nodes having k,- » k. These high-degree 
nodes or hubs have an important role in many proper- 
ties of the network. Conversely, the network contains 
many nodes that are poorly connected. The connec- 
tivity description of the network is thus that of “few 
giants, many dwarves.” 

Although measuring the tail of heavily skewed dis- 
tributions is statistically tricky, recent work has found 
that some real-world networks have power-law degree 
distributions pi ~ k _ “, where the scaling exponent typ- 
ically lies in the range 2 < ex < 3. Some instances of 
this observation are the network structure of the Inter- 
net, the network of links between Web pages, the net- 
work of citations between papers, phone communica- 
tion networks, metabolic networks, and financial net- 
works. Networks with power-law degree distributions 
are usually referred to as scale-free networks because 
Pk lacks a characteristic degree. They have attracted 


wide attention in the literature due to their ubiquity in 
many complex systems and also because of their possi- 
ble modeling by simple growth models. In other situa- 
tions, degree distributions seem to be better described 
by exponentials or power-law distributions with expo- 
nential cutoffs. Regardless of which statistical model is 
best for describing pk, the large heterogeneity found in 
networks implies that there is no such thing as typical 
connectivity in the network. Node degree is not the only 
network property that shows heterogeneity. For exam- 
ple, the weights (intensity) of edges, the frequency of 
motifs, and the distribution of community sizes are all 
described by broad distributions, showing that hetero- 
geneity appears at different scales of the network and 
in various network descriptions. 

3.2 Clustering 

As well as being unequally shared among the nodes, 
edges tend to be clustered in a network. For example, 
the neighbors of a given node are very likely to them- 
selves be linked by an edge. In the language of social 
networks, the probability that a friend of your friend is 
also your friend is very large. A way of measuring net- 
work clustering is to calculate the transitivity or clus- 
tering coefficient, 0 ^ C ^ 1, which measures the rel- 
ative frequency of triangles (cliques of size 3) in the 
network with respect to the total number of triplets 
(three nodes connected by at least two edges). In a fully 
connected graph, C = 1. Clustering can also be mea- 
sured locally: C, measures the fraction of neighbors of 
a node i that are also neighbors of each other. Social 
and biological networks display large clustering coef- 
ficients when compared with random network models, 
while technological networks like the World Wide Web, 
the Internet, or power grids have much less cluster- 
ing. The origin of this difference lies in the potential 
mechanisms behind clustering or its absence: in social 
networks people who spend time with a common third 
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Figure 1 (a) Yeast protein-protein interactions (data from 
von Mering, C., et al., 2002, Nature), (b) The power grid of 
the western states of the United States (data from Watts, 
D. J., and S. H. Strogatz, 1998, Nature). Links that belong 
to a triangle are shown in black, with the rest in gray. For 
reasons of clarity, nodes are removed in both cases. The 
clustering coefficients are (a) C = 0.44 and (b) C = 0.09. 


person are likely to encounter each other, and thus tri- 
adic closure is favored by social interactions. Actually, 
triadic closure is exploited by many online social net- 
works for friend-recommendation algorithms. In bio- 
logical networks large clustering might be due to con- 
current interaction of proteins in biological processes. 
However, efficiency in technological networks discour- 
ages the formation of redundant edges between nodes 
that are already close in the network, and C is therefore 
very small (see figure 1). 

In some networks Cf decreases with the degree 
ki. This means that low-degree nodes are densely 


clustered while hubs’ neighborhoods are sparsely con- 
nected, a result which, together with the existence of 
densely connected groups of nodes (communities; see 
section 3.5), reflects the hierarchical organization of 
some networks: low-degree nodes are situated in dense 
communities, while hubs link different communities. 

Triangles are not the only cliques or motifs that 
are over-represented in networks. For example, some 
three- or four-node motifs occur in large numbers even 
in small networks. However, most represented motifs 
in food webs, for example, are distinct from those 
found in transcriptional regulatory networks and from 
those in the World Wide Web, a finding that some 
authors ascribe to the function of the network. Net- 
works may therefore be classified into distinct func- 
tional families based on their typical motifs. At a higher 
level, the relative size of the maximal clique or the k- 
core of the graph are also measures of the degree of 
clustering in the network. 

3.3 Small World 

Many naturally occurring networks exhibit the small- 
world phenomenon ; that is, they have a small graph 
diameter. This was famously illustrated by the psy- 
chologist Stanley Milgram in 1967, who discovered 
the famous “six degrees of separation” (on average) 
between two persons in the worldwide social network. 
The experiment was later reproduced using email and 
measuring the distance in massive social graphs. The 
fact that networks are densely connected creates a 
wealth of short paths between nodes, and the typical 
distance between nodes is therefore very small. Illus- 
trations of the small-world phenomenon are actors’ 
“Bacon numbers,” or mathematicians' Erdos numbers: 
the distance from Paul Erdos in the graph whose edges 
represent coauthorship of papers. The average Erdos 
number is only around 5. 

Mathematically speaking, a small-world network re- 
fers to a network model in which the diameter increases 
sufficiently slowly with the number of nodes n in the 
network — typically, as log n or slower. However, this 
property is trivially satisfied in a tree-like network if we 
assume that the number of nodes at distance d from a 
given node scales as d k , where k is the average num- 
ber of neighbors. The small-world property is there- 
fore sometimes accompanied by the condition that the 
clustering coefficient is bounded away from zero as n 
increases. In real-world networks, where n is fixed, a 
small-world network is defined as one in which k is 
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smaller and C is much bigger than the values found 
in statistically equivalent random network models. 

3.4 Centrality 

The concept of the centrality of a network tries to 
answer the question of which is its most important 
node. Depending on the nature of the relationships 
and the process under study, centrality means different 
things. For example, degree centrality approximates 
the centrality of a node by its degree in the network k, . 
In most situations, this is a highly effective measure of 
centrality. These highly connected nodes, usually called 
hubs, might have a major impact on the robustness of 
the network if they are removed or damaged. 

However, hubs might be located in the periphery of 
the network. More global and advanced measures of 
centrality are eigenvector, betweenness, and closeness 
centrality, which take into account the relative position 
of one node with respect to the rest. 

Eigenvector centrality relies on the idea of central- 
ity propagation: the centrality of a node Xj is a lin- 
ear function of the centrality of its neighbors X; = 
(l/A)Xj a ij x j- Writing x = (xi, . . . ,xn ), we obtain an 
eigenvector equation Ax = Ax. Among all the possi- 
ble solutions of this equation, the one with the great- 
est eigenvalue has all entries positive (by the perron- 
frobenius theorem [IV.10 §11.1]). The components of 
the related eigenvector are taken to be the centrality 
of each node. Several generalizations of this method 
are possible; for example, a variant of eigenvector cen- 
trality is Google’s PageRank algorithm [VI.9], used 
to rank the importance of Web pages, which in turn is 
also used in network analysis to measure the centrality 
of nodes. Some other centrality measures, such as Katz 
centrality, the Bonacich power, and the Estrada index, 
can be obtained as solutions of eigenvalue problems of 
FUNCTIONS OF THE ADJACENCY MATRIX [11.14]. 

Another well-known centrality measure is between- 
ness, introduced by Freeman in 1979, which is defined 
as the number of times node i appears in the short- 
est paths between any pair of nodes in the network. 
Specifically, if gjk is the number of geodesic paths from 
i to j, and if g^ is the number of these geodesics 
that pass through node i, then the betweenness cen- 
trality of node i is given by b t = Zj*k*i3ikj Idij- 
The idea behind betweenness is that it measures the 
volume of flow through a node of a process that is 
happening in the network. Finally, closeness centrality 
(also introduced by Freeman) is defined as the average 


distance from one node to the rest of the network, i.e., 
Ci = (1 In)£j5(i,j). 

Centrality is widely used to determine the key nodes 
in the network. For example, it is used to find influential 
individuals in online social networks, to identify lead- 
ers in organizations, as a method for selecting individu- 
als to target in viral marketing campaigns, or to identify 
central airports in the air transportation network. 

3.5 Communities 

In most networks (see plate 5 and figure 2) there are 
groups of nodes that are more densely connected inter- 
nally than with the rest of the network. This feature of 
a network is called its community structure. Commu- 
nity structure is common in many networks, and its 
determination yields a mesoscopic description of the 
network. But primarily it is interesting because it can 
reveal groups of vertices that share common proper- 
ties or subgraphs that play different functional roles in 
the system. In fact, communities in social networks are 
found to be related to social, economic, or geographi- 
cal groups; communities in metabolic networks might 
reflect functional groups; and communities in citation 
networks are related to research topics. 

The hypothesis underneath these findings is that 
the network itself contains the information needed 
to reveal the groups and that the communities can 
be obtained using a graph-partitioning technique that 
assigns vertices to each group (see figure 2). Mathemat- 
ically speaking, the problem of identifying graph com- 
munities is not well defined. To start with, there is no 
clear definition of what a community is in a graph. This 
ambiguity is the reason behind the wealth of algorithms 
in the literature, each of which implicitly assumes its 
own mathematical and/or statistical definition of com- 
munities and thus produces different partitions in the 
graph. On top of that, graph-partitioning problems are 
typically np-hard [1.4 §4.1] and their solutions are gen- 
erally achieved using heuristics and/or approximation 
algorithms that might not deliver the exact solution 
or even the same approximate solution. Most of the 
time, therefore, we need further external information 
to validate the partition obtained. 

Many different methods have been developed and 
employed for finding communities in graphs. Some of 
them rely on graph-partitioning techniques (such as 
the minimum-cut method) or data-clustering analysis 
(such as hierarchical clustering) borrowed from com- 
puter science. Some problems with these methods are 




Plate 1 (11.16). Force-directed graph visualizations. A sample of forty-nine graphs from the University of 
Florida Sparse Matrix Collection. The color is determined by the length of the edges: short ones are red, 
medium-length edges are green, and long edges are blue. 




Plate 2 (11.16). Looking like Dr. Seuss’s “red fish, blue fish,” 
the top image is the graph from a constraint matrix for a 
linear programming problem, while the bottom image is the 
graph from a frequency-domain circuit simulation. 




Plate 3 (11.16). The graph of a Hessian matrix from a 
convex quadratic programming problem. 



Plate 4 (11.16). A close-up of a graph from a 
financial portfolio optimization problem. 





Plate 5 (IV.18). Network of email exchanges between aca- 
demic staff at Universidad Carlos III de Madrid. The graph 
has 1178 nodes and 3830 links. Each link indicates that 
at least two emails were exchanged between those nodes, 
and the node colors correspond to the different depart- 
ments within the institution. Node size and link width are 
log-proportional to their degree and weight (the number of 
emails exchanged), respectively. Square white nodes corre- 
spond to the mathematics department, which forms a dense 
community within the network. The two-dimensional lay- 
out was obtained using a force-directed graph-drawing algo- 
rithm. The network has average degree k = 6.5, clustering 
coefficient C = 0.21, and diameter £ = 12. (b) The distribu- 
tion of connectivity pk for the network. The red vertical line 
corresponds to the value k = k. 
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Plate 6 (V.17). An X-ray computed tomography volume rendering of brine layers within a lab-grown sea ice 
single crystal with 5 = 9.3 ppt. The (noncollocated) 8 mm x 8 mm x 2 mm subvolumes (a)-(c) illustrate a 
pronounced change in the microscale morphology and connectivity of the brine inclusions during warming 
((a) T = -15 °C, <t> = 0.033; (b) T = - 6 °C, <f> = 0.075; (c) T = -3 "C, = 0.143). (d) Data for the vertical fluid 

permeability k taken in situ on Arctic sea ice, displayed on a linear scale, (e) Divergence of the brine correlation 
length in the vertical direction as the percolation threshold is approached from below, (f) Comparison of Arctic 
permeability data in the critical regime (twenty-five data points) with percolation theory in (7). In logarithmic 
variables, the predicted line has the equation y = 2.v - 7.5, while a best fit of the data yields y = 2.07x - 7.45, 
assuming <£ c = 0.05. (Parts (a)-(d) are adapted from Golden, K. M., H. Eicken, A. L. Heaton, J. Miner, D. Pringle, 
and J. Zhu. 2007. Thermal evolution of permeability and microstructure in sea ice. Geophysical Research 
Letters 34:L16501. Copyright 2005 American Geophysical Union. Reprinted by permission of John Wiley & 
Sons, Inc.) 



Plate 7 (V.17). Ocean swells propagating through a vast 
field of pancake ice in the Southern Ocean off the coast of 
East Antarctica (photo by K. M. Golden). These long waves 
do not “see” the individual floes, whose diameters are on 
the order of tens of centimeters. The bulk wave propaga- 
tion characteristics are largely determined by the homog- 
enized or effective rheological properties of the pancake/ 
frazil conglomerate on the surface. 



Plate 8 (V.l 7). (a) Shading shows the solution to Laplace’s equation within the Antarctic MIZ (if/) on August 26, 
2010, and the black curves show MIZ width measurements following the gradient of ip (only a subset is shown 
for the sake of clarity) (courtesy of Courtenay Strong), (b) Same as (a) but for the Arctic MIZ on August 29, 2010. 
(c) Width of the July-September MIZ for 1979-2011 (red curve). Percentiles of daily MIZ widths are shaded 
dark gray (25th to 75th) and light gray (10th to 90th). Results are based on analysis of satellite-derived sea 
ice concentrations from the National Snow and Ice Data Center. (Parts (b) and (c) are adapted from Strong, C., 
and I. G. Rigor. 2013. Arctic marginal ice zone trending wider in summer and narrower in winter. Geophysical 
Research Letters 40(18):4864-68.) 




Plate 9 (V.17). The evolution of melt pond connectivity and color-coded connected components: (a) discon- 
nected ponds, (b) transitional ponds, (c) fully connected melt ponds. The bottom row of figures shows the 
color-coded connected components for the corresponding image above: (d) no single color spans the image, 
(e) the red phase just spans the image, (f) the connected red phase dominates the image. The scale bars repre- 
sent 200 m for (a) and (b), and 35m for (c). (Adapted from Hohenegger, C., B. Alali, K. R. Steffen, D. K. Perovich, 
and K. M. Golden. 2012. Transition in the fractal geometry of Arctic melt ponds. The Cryosphere 6:1157-62 
(doi: 10. 5 1 94/tc-6-l 157-2012).) 






Plate 10 (VI.S). The flow generated by a two-dimensional 
flapping wing mimicking dragonfly wing motion. The col- 
ors indicate the vorticity held, with red and blue represent- 
ing positive and negative vorticity, respectively. The wing 
motion creates a downward jet composed of counterrotat- 
ing vortices. Each vortex pair can be viewed as the cross 
section of a donut-shaped vortex ring in three dimensions. 
From Z. J. Wang (2010), Two dimensional mechanism for 
insect hovering, Physical Review Letters 85(10):2216-19. 



Plate 1 1 (VI.5). A fruit fly making a sharp yaw turn of 120° 
in about 20 wing beats, or 80 ms. The wing hinge acts as 
if it is a torsional spring. To adjust its wing motion, the 
wing hinge shifts the equilibrium position of the effective 
torsional spring, and this leads to a slight shift of the angle 
of attack of that wing. The asymmetry in the left and right 
wings creates a drag imbalance that causes the insect to 
turn. To turn 120°, the asymmetry in the wing angle of 
attack is only about 5° or so. From A. J. Bergou, L. Ristroph, 
J. Guckenheimer, I. Cohen, and Z. J. Wang (2010), Fruit flies 
modulate passive wing pitching to generate in-flight turns, 
Physical Review Letters 104:148101. 





Plate 12 (VII.7). CIE 1931 color space chromaticity dia- 
gram, with the gamut of sRGB shown. (File adapted from 
an original on Wikimedia Commons.) 



Plate 13 (VII.7). (a) Original image, (b) Image converted to 
LAB space and A channel negated ((L,A,B, ) — (L, -A,B)). 




Plate 14 (VII.8). An inpainted image. (Courtesy of 
Bugeau, Bertalmio, Caselles, and Sapiro.) 



Plate 1 5 (VII.8). A contour (in green) evolving from the ini- 
tial position in part (a) to the segmentation in part (d). 
The deformation (the two stages shown in parts (b) and 
(c)) is governed by a geometric partial differential equation. 
(Courtesy of Michailovich, Rathi, and Tannenbaum.) 




(a) 







Plate 16 (VII.8). For each of the two examples, the subfigures 
in the top row correspond to three original frames, while 
those in the bottom row are (from left to right) two cor- 
responding white-background composites and a (different) 
special effect. The special effects are (a) a delayed-fading 
foreground and (b) an inverted background. (Original videos 
courtesy of Artbeats (www.artbeats.com) and Getty Images 
(www.gettyimages.com).) 



Plate 17 (VII.8). The first column shows two frames from 
a mimicking video, while the tracking/segmentation masks 
are displayed in different colors in the remaining columns. 
The two dancers are correctly detected as performing dif- 
ferent actions. (Courtesy of Tang, Castrodad, Tepper, and 
Sapiro.) 





Plate 18 (VII.9). (a) A PET “heat map” image; (b) the image in (a) fused with the CT scan of the same section 
shown in (c). From the fused image it is apparent that the increased uptake of fluorodeoxyglucose, indicated 
by the yellow arrow, is in the gall bladder and is not the result of bowel activity. (Images courtesy of Dr. Joel 
Karp, Hospital of the University of Pennsylvania.) 




(b) 



Plate 19 (VII.13). (a) Direct volume rendering of the insta- 
bility of an interface between two fluids of different den- 
sities, termed the Rayleigh-Taylor instability, (b) Isosur- 
facing used to visualize the visible human data set (www 
.imagevis3D.org). 




Plate 20 (VII.13). (a) Vector visualization of a stellar mag- 
netic field using streamlines (Schott et al. 2012). (b) Glyph- 
based tensor visualization of anatomic covariance tensor 
fields (Kindlmann et al. 2004). 




Plate 22 (VII. 13). Ensemble Vis: a framework designed for 
exploring short-term weather forecasts (Potter et al. 2009). 




Plate 23 (VII.13). Visualizations of bioelectric fields in 
the heart and brain (Tricoche and Scheuermann 2003). 
(a) Stream surfaces show the bioelectric field in the direct 
vicinity of epicardium, or outer layer of the heart, (b) Tex- 
tures applied across a cutting plane reveal details of a 
source of electric current in the brain and the interaction 
of the current with the surrounding tissue. 
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Figure 2 A social network of friendships between thir- 
ty-four members of a karate club at a U.S. university in the 
1970s. The club was led by president John A. and karate 
instructor Mr. Hi (pseudonyms). Edge width is proportional 
to the number of common activities the club members took 
part in. In 1997 Zachary studied how the club was split 
into two separate clubs (white and gray symbols) after long 
disputes between two factions within the club. A commu- 
nity-finding algorithm (label propagation) applied to the 
network before the split finds three communities (dashed 
lines) that have a large overlap with the factions after the 
split. 

that the number of communities has to be given by 
the algorithm and/or there is no way to discriminate 
between the many possible partitions obtained by the 
procedure. To circumvent these issues a quality func- 
tion of the partition has to be given. In a seminal paper, 
Newman and Girvan introduced the concept of the mod- 
ularity of a partition, which measures the difference 
between the actual density of edges within communi- 
ties and the fraction that would be expected if edges 
were distributed at random: 

Q = 2 m X (a ij - PiJ>y (c i’ c j>’ 

where the sum runs over all pairs of vertices, y(C;, Cj ) = 
1 if i and j are in the same community (zero other- 
wise), and pij represents the expected number of edges 
between i and j in a null model, i.e., a model of a 
random graph with the same number of nodes and 
edges (see section 4). Given the large heterogeneity of 
degrees in the network, the most used model is the con- 
figuration model in which py = kikj/{2m). Despite 
its limitations, modularity has had a great impact in 
community-finding algorithms. It gives a quantitative 


measure of the partition found, and it can therefore be 
employed to get the best (maximum-modularity) par- 
tition in divisive algorithms. In addition, modularity 
optimization is itself a popular method for community 
detection. 

Other popular methods include clique-based meth- 
ods such as the clique-percolation algorithm (which 
defines a community as percolation clusters of k- 
cliques), random-walk methods (where a community 
is a region of the graph in which a random walker 
spends a long time), and consensus algorithms (where 
a community is defined as a group of nodes that 
share common outcomes in a dynamical coordina- 
tion process). Examples of consensus algorithms are 
label-propagation algorithms, in which nodes are given 
unique labels that are then updated by majority voting 
in the neighborhood of nodes. Labels reached asymp- 
totically by this consensus process are taken as com- 
munities in the graph. 

Community finding in networks is a computationally 
complex task. Typically, algorithm times scale with the 
number of nodes n and edges m, and some methods 
are therefore not suitable for finding communities in 
very large networks. In recent years, however, much 
progress has been made in accelerating the algorithms, 
and it is possible to efficiently apply algorithms such as 
the Louvain method, Infomap, or the fast greedy algo- 
rithm of Clauset, Newman, and Moore to networks with 
millions of nodes. 

4 Models 

Building on the observations of real-world networks 
and their common properties found in many systems, 
one can create mathematical models of networks. Mod- 
els are important for two reasons: the first is that sim- 
ple null network models can be used to test the sta- 
tistical significance of the results found in real-world 
applications. Of course, the very definition of null net- 
work models depends on the context and the process to 
be considered. On the other hand, modeling networks 
allows us to obtain good mathematical representations 
of the observed systems, which can then be used for 
further testing or even to make predictions about the 
future behavior of a process. 

4.1 The Erdos-Renyi Model 

The random graph model is the most used model 
type. In this technique, graphs are generated accord- 
ing to a probability distribution or a random process. 
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Figure 3 Examples of networks generated by (a) the 
Erdos-Renyi model, (b) the preferential-attachment model, 
and (c) the small-world model. All networks have the same 
number of nodes n = 50 and edges m = 100. The size of a 
node is proportional to its connectivity k;. 


Here, we review some of the most popular examples in 
mathematical studies and applications. 

In 1959 Paul Erdos, Alfred Renyi, and (independently) 
Edgar Gilbert proposed a simple binomial random 


graph model, G(n, p), in which the graph of n nodes 
is constructed by connecting each pair of nodes with 
probability p (see figure 3). The distribution of degrees 
in G(n, p) is given by the binomial distribution, which 
becomes a Poisson distribution in the limit of large n: 


Pk 


n - 1 
k 


p k ( 1 - p) 


n-l-k 


k k e k 
k\ 


where k = (n - l)p is the mean degree. Thus, pk does 
not have heavy tails. Despite this difference from real 
networks, many properties of G(n,p) are exactly solv- 
able in the limit of large n, and the random graph 
model has therefore attracted a lot of attention in the 
mathematics community. For example, the diameter 
is approximately given by t ~ logn/logk, and thus 
G(n,p) has the small-world property. On the other 
hand, the clustering coefficient isC = p = k/(n - 1), 
which tends to zero in the limit of a large system (for 
finite k), unlike real-world networks where C is finite 
even for large n. Obviously, G(n,p) does not have 
any community structure either. Nonetheless, G(n, p) 
can be used as a zero-information model, providing a 
benchmark for comparison with other models and data. 


4.2 The Configuration Model 

To allow for non-Poisson degree distributions, we can 
generalize the Erdos-Renyi model to the configuration 
model, which was first given in its simple explicit form 
by Bollobas. In this model the degree distribution pk or 
degree sequence fci , . . . , k n is given and a random graph 
is formed with that particular distribution or degree 
sequence. A simple algorithm to generate such ran- 
dom graphs is to give each vertex a number of “stubs” 
of edges, either according to the distribution or from 
the degree sequence, and then pick stubs in pairs uni- 
formly at random and connect them to form edges. The 
ensemble of graphs produced in this way is called the 
configuration model. Many properties of the configura- 
tion model are known. Molloy and Reed showed that 
there is almost surely a giant component if k 2 - 2k > 0 
(where k 2 = Y.k k 2 Pk), and that the probability of find- 
ing a loop on the graph decays like n -1 , i.e., the graph 
has a local tree-like structure for large n. This was used 
by Newman to show in a simple way that the clustering 
coefficient decays asymptotically like C ~ n 1 . 

One of the most important properties of the config- 
uration model is that the distribution qk of degrees of 
nodes obtained following a randomly chosen edge is 
not given by pk- Instead, since in choosing a random 
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edge there are k ways to get to a node with connectiv- 
ity fc, we obtain that qt ~ kpu . The average degree of 
(next-nearest) neighbors of a node is therefore 

knn — ^ kqk ~ - ■ 

k fc 

Since real-world networks are very heterogeneous, we 
usually find that fc 2 > k, and the neighbors of a ran- 
domly chosen node therefore have more connectivity 
than the node itself. This fact has a key role in many 
processes that happen in networks (see section 6). In 
social networks it is known as the friendship para- 
dox, posed by the sociologist Feld in 1991: on aver- 
age, your friends have more friends than you do. It 
also has other direct consequences; for example, in 
scale-free networks with 2 < « < 3, we find that 
the diameter £ ~ log log N, and, thus, scale-free net- 
works are ultrasmall. For a > 3 we recover the small- 
world behavior £ ~ log N. Finally, since edges are dis- 
tributed randomly among nodes, the probability that 
two nodes with degrees k, and kj are connected is 
given by kikj / (2m), a result used in the definition of 
modularity Q as a null model approximation for py. 

5 Other Random Models 

Configuration models can be generalized to exponen- 
tial random graphs, also called p* models. These are 
ensembles of graphs in which the probability of observ- 
ing a graph G with a given set of properties {xj is 
P(G) = e H(G) jZ, where Z = £ G e H(G) and the network 
Hamiltonian is given by H(G) = Y.i diXi(G), with {0*} 
the ensemble parameters. For example, we could take 
Xi to be the number of edges, X 2 to be the number 
of vertices with a given degree, xj, to be the number 
of triangles in the network, and so on. Much of the 
progress made in this field has come through the use 
of Monte Carlo simulations of the ensemble and/or by 
using real network data to estimate the parameters of 
the model. 

Another random graph model is the family of sto- 
chastic block models, in which nodes are assigned to 
5 different blocks and vertices are placed randomly 
between nodes of different blocks with a probability 
that depends only on the blocks of the nodes. Specifi- 
cally, if z; denotes the block that the vertex i belongs to, 
then we can define an 5 x 5 stochastic block matrix M, 
where my gives the probability that a vertex of type 
Zi is connected to a node of group Zj. Blocks can be 


groups of nodes that have similar structural equiva- 
lence, have similar demographics, or belong to the same 
community. 

5.1 The Small-World Model 

The small-world model was introduced by Watts and 
Strogatz as a random graph that has two independent 
structural properties of real networks: a finite cluster- 
ing coefficient, and the small-world property £ ~ log n 
when n increases. The basic idea of this model is to 
build a graph embedded in a one -dimensional lattice 
in which nodes are connected to a local neighborhood 
of size d and then the edges are “rewired” with prob- 
ability p (see figure 3). When p = 0 we recover a 
one-dimensional regular graph, while when p = 1 we 
recover a random graph. The local tight neighborhood 
provides the finite clustering property, while the long- 
range rewired links are responsible for the low diam- 
eter of the network. Specifically, as n — ■ co the clus- 
tering coefficient is C ~ 3 (d - l)/[2(2d - 1 ) ] ( 1 - p) 3 
and the diameter scales as n and logn for p = 0 
and p = 1, respectively. One interesting feature of the 
Watts-Strogatz model is that it interpolates between 
a regular graph and a completely random graph. A 
major limitation of the model is that it produces unre- 
alistic degree distributions (pk decays exponentially). 
Many variations of the small-world model have been 
proposed and studied. 

5.2 The Preferential-Attachment Model 

While the models above incorporate the observed 
macrofeatures of networks, generative models try to 
explain those features as an emergent property that is 
due to microscopic mechanisms. A particularly popular 
class of models is the network growth model, in which 
the dynamics of node and edge creation create the 
observed network properties. Probably the best-known 
example is the so-called preferential-attachment model 
of Barabasi and Albert, which aims to explain the scale- 
free property of the degree distribution. This is done 
using the rich-get-richer mechanism in edge creation: 
nodes are added to the network with a certain num- 
ber m of edges emerging from them, and those edges 
are connected to preexisting nodes with connectiv- 
ity k with probability tt k- This preferential-attachment 
mechanism has been observed in many growing net- 
works, from citation and collaboration networks to the 
Internet and online social networks. Simple rate equa- 
tions can be written for the evolution of the system, 
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and we find that the degree of each node grows accord- 
ing to a power law fcj(t) ~ and that for all t the 
degree distribution is scale free pk ~ fc _ “, where the 
exponents cx, p depend on the preferential-attachment 
probability. In the linear case, where Ttj; = k, we obtain 
that a = 3, a value that is similar to those found in 
real-world networks such as the World Wide Web. 

The preferential-attachment model has the small- 
world property with a double logarithmic correction 
£ ~ log n/ log log n, and numerical results suggest 
that the cluster coefficient C behaves as C ~ n~ 0 - 75 . 
Although C vanishes in the limit n -> oo, it can have 
large values for finite n. One of the main criticisms 
of the preferential-attachment mechanism is that it 
requires global information about the network, which 
is clearly an unrealistic assumption in many situations. 
This and other criticisms have been addressed by mod- 
ifying the original model using local network formation 
rules, triadic closure, or finite memory of nodes. 

6 Processes 

Beyond the metrics and models presented in the previ- 
ous sections, much of our knowledge about networks 
comes from our ability to explain processes happening 
on the network by analyzing its structure. For example, 
we would like to understand whether the node with the 
largest centrality is also the fastest to become aware of 
information spreading in the network. Or perhaps we 
want to know what the relationship is between an Inter- 
net network’s degree of heterogeneity and its resilience 
to targeted attacks. Progress in this direction comes 
from simulating simple models of these processes on 
real or synthetic models of networks and from the 
availability of data that is empowering (and sometimes 
challenging) our understanding of how networks work. 

6. 1 Spreading 

Probably the most-studied process in networks is how 
things spread around them. In particular, a wealth 
of work has been done on understanding how net- 
work structure (and dynamics) impacts the spread- 
ing of information, viruses, diseases, rumors, etc. For 
example, people have studied how the structure of net- 
works of sexual contacts (or the network of flows of 
passengers between airports and cities) influences the 
spread of diseases; others have studied the best way 
of choosing the people who are initially targeted in a 
viral marketing campaign on a social network in order 
to optimize the reach and velocity of the campaign. 


Spreading models are typically borrowed from epi- 
demiology: diseases spread to susceptible (noninfect- 
ed) (S) nodes when they are exposed to infected (I) 
nodes, and then they can decay into the recovered (R) 
state. This is the so-called sir model [V.16], which has 
a long history in mathematical epidemiology. Although 
the model is dynamical in nature, Grassberger found 
that it could be mapped exactly onto bond percolation 
in the network: outbreaks in the spreading process cor- 
respond to clusters in the percolation problem. Perco- 
lation is a well-known problem in mathematics; in its 
bond version it describes the behavior of connected 
clusters in a random graph in which edges are occu- 
pied with probability A. The question is then whether 
there is a cluster that “percolates” the whole network, 
i.e., a connected cluster of a fraction of the network, 
the so-called giant component. This would correspond 
to a large disease outbreak in the spreading problem. In 
most systems there exists a critical A c at which the per- 
colation transition, or epidemic breakout in epidemiol- 
ogy, happens for A > A c . In viral marketing we would 
like our campaigns to operate above A c , so the spread- 
ing “goes viral,” while in disease spreading, vaccination 
and health policies are designed to maintain A < A c . 
Although A c depends on the virulence of the disease, 
it is also affected by the structure of the network in 
which disease propagates. For example, in the con- 
figuration model with degree distribution pk, starting 
from one infected initial seed we have, on average, that 
Ro = A k of its neighbors are infected. Each of these 
Ro infected neighbors goes on to infect an average of 
Ri = A(knn - 1) new next-nearest neighbors. The same 
happens in the following steps. The size of the outbreak 
is therefore given by 

s = 1+Ro + RoRi +R 0 R 1 + ■■ ■ = 1+ T-^V- (1) 

t - Kl 

Thus, outbreak size diverges as R \ — 1, that is, when 
A — A c = l/(knn - 1) = k/(k 2 - k). This is actually the 
Molloy and Reed criterion for the existence of a giant 
component applied to the corresponding configuration 
model in which edges are occupied with probability A. 
Since heterogeneity in networks usually implies that 
k 2 » fe, the epidemic threshold is always very small. 
Moreover, in networks that are scale free with expo- 
nent cx ^ 3, we have that k 2 diverges. This implies that 
in highly heterogeneous networks, the critical point 
vanishes and information or disease spreads all over 
the network, an interesting result of Pastor-Satorras 
and Vespignani that has been suggested as an explana- 
tion for the prevalence of computer viruses. Although 
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this result was obtained for a very particular epidemic 
model in the configurational model, the general under- 
standing is that network heterogeneity favors spread- 
ing. Indeed, Rq and R\ are widely used in epidemiology 
to assess epidemic outbreak, where they are known as 
the basic reproduction numbers. Away from the config- 
uration model, epidemic spreading in real networks is 
controlled by the first eigenvalue <Ti of the adjacency 
matrix, so that A c = 1/oy. Furthermore, network clus- 
tering and its community structure tends to reduce the 
spread of infection, since spreading gets trapped in 
densely connected areas of the network. 

Two related questions are finding the nodes in the 
network that spread most efficiently and, conversely, 
finding the ones that are affected by the spreading pro- 
cess earliest. Both of these questions are related to cen- 
trality in the network: the more central a node is, the 
larger the outbreak cluster that can generate from it, 
and also the faster it becomes aware of the spread- 
ing process. This result allows us to strategically target 
initial spreaders in information spreading or to vacci- 
nate more central people in social networks to stop epi- 
demics. But it also provides a way of choosing a set of 
central nodes (sensors) in the network that can help us 
detect epidemic outbreaks or information spreading as 
quickly as possible. For example, Fowler and Christakis 
and collaborators have found that a set of sensors is 
highly effective in detecting outbreaks in information 
and disease spreading, even in massive networks. 

6.2 Contagion 

Closely related to information diffusion, yet mechanis- 
tically different, is behavior contagion, or how expo- 
sure to certain individual behavioral characteristics can 
drive the propagation of those characteristics from 
individual to individual in a network. The importance 
of social-network structure in promoting positive or 
arresting damaging behavior has recently been stud- 
ied in many contexts, ranging from spreading of health 
behaviors, to product/service adoption, to political 
opinion, to participation in time-critical events, to the 
diffusion of innovations, and so on. Taken together, 
these studies hint at some generalities regarding behav- 
ior contagion: it critically depends on the structural 
diversity of the network (i.e., from how many differ- 
ent social communities the behavior is exposed) and 
on social reinforcement (i.e., how additional exposures 
change the probability of adoption and how cluster- 
ing in networks promotes it). However, recent work 


has highlighted the problem that most of the observed 
causal influence in social networks could be chiefly due 
to latent homophily or other assortative confounder 
variables. For example, an individual might buy a par- 
ticular product not because she is influenced by the 
network surrounding her but because she belongs to 
a group to which that product is appealing. A more 
refined statistical network would be needed to establish 
causality in behavior contagion. 

6.3 Robustness 

In many applications, networks should be robust to 
small topological or dynamical perturbations. For ex- 
ample, communication networks, power grids, or or- 
ganizational business processes should be resilient 
to intentional attacks, random failures in stations, 
or organizational changes, respectively. In the most 
simple approximation network, robustness is studied 
under a random or intentional removal of nodes and/or 
edges. The question is at which point in this removal 
process does the network stop operating as intended, 
e.g., when a significant fraction of nodes in communi- 
cation networks can no longer communicate with each 
other. This obviously happens when there is no path 
between those nodes, and in that case the network 
becomes disconnected with a number of small compo- 
nents. In this form, the robustness problem is then very 
similar to percolation, where nodes or edges are unoc- 
cupied when they are attacked or they fail. Depending 
on the removal strategy there will be a critical fraction 
q c such that when q nodes are removed with q > q c 
the network becomes disconnected, wdrile the removal 
of q < q c nodes leaves a large fraction of nodes in a con- 
nected component (the giant component). For example, 
if a fraction q of randomly chosen nodes are removed 
from the configuration model with degree distribution 
Pk, then each node has (1 - q)k neighbors on average 
and each of the neighbors will itself have (1 - q)krm 
neighbors. Thus, starting from a node that has not been 
removed we can form a connected component that has 
average size given by equation (1), where Rq = (1 - q)k 
and R\ = (1 - q)(k ml - 1), recovering the Molloy- 
Reed criterion that q c ~ 1 - k/k 2 . This implies that 
highly heterogeneous networks (where k 2 » k) are very 
robust against random attacks. Conversely, given the 
structural importance (centrality) of hubs or bridges, 
an intentional attack to remove those nodes could dis- 
connect the network very easily. Thus, although the het- 
erogeneous and modular structure of networks makes 
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them resilient against random attacks, the central role 
of hubs and bridges between communities makes net- 
works very fragile when it comes to targeted attacks 
against them. 

In some situations failures can propagate in the net- 
work like avalanches. This is the case in power trans- 
mission grids or the Internet, where the removal or fail- 
ure of nodes changes the balances of flows and could 
lead to high loads in other nodes and their possible 
failure. Another example is cascades of bank failures 
in financial networks. In this case, the robustness of 
the network depends both on its topological structure 
and on how the capacities/tolerance of nodes are dis- 
tributed over it. Finally, in some technological problems 
networks are interconnected; for example, in Italy’s 
major 2003 blackout it was found that the coupling 
between the power grid network and the communica- 
tion network caused further breakdown of power sta- 
tions. The effect of heterogeneity in random attacks 
is the opposite in interconnected networks from the 
effect in simple networks: a broader distribution of 
degree increases the vulnerability of interconnected 
networks. 

6.4 Consensus 

Repeated information sharing and behavior contagion 
can lead to the formation of consensus or synchroniza- 
tion in the network. Examples of consensus formation 
in networks are the problems of opinion formation in 
society, synchronization of biological neural networks, 
or protocols in distributed computing. Simple mod- 
els of consensus formation are the well-known math- 
ematical problems of the voter model or the Kuramoto 
model of coupled oscillators. Important questions in 
consensus processes are whether the network struc- 
ture favors the appearance of consensus and its impact 
on the time to reach it. The answer to the first question 
depends on the way in which models or experiments 
are engineered. For example, in the voter model, con- 
sensus is not reached in general in heterogeneous net- 
works, although finite-size fluctuations induce consen- 
sus after a time t ~ n. Since most consensus problems 
can be written in a diffusion-like framework, the time 
to consensus can be obtained from the spectral prop- 
erties of the graph. For example, consider the simple 
local average consensus problem 

xi = X (*;(*) -Xi(t)), 

jeNi 


where the fhfi are the neighbors of node i, and Xi(t) 
is the state of node i. The collective dynamics of the 
agents can be written as x = -Lx, where L is the Lapla- 
cian matrix [L\ij = k(5y - ay. Since L always has a 
zero eigenvalue 01 = 0, the timescale in the consensus 
problem is given by the second smallest eigenvalue 02 , 
which is usually called the algebraic connectivity. The 
time to achieve consensus is then given by t ~ I/ 02 . 
In general, heterogeneity has a large influence on t: 
while in the voter model larger heterogeneity (more 
hubs) favors consensus, synchronization of oscillators 
is more difficult in highly heterogeneous networks. On 
the other hand, it appears that community structure 
has a large influence in reaching consensus. In fact, this 
is the main idea behind some community-finding algo- 
rithms that are based on consensus formation, such as 
the label-propagation algorithm (see section 3.5) 

6.5 Network Prediction/Inference 

Networks contain explicit information about the nodes 
they contain and how they interact, but they might also 
contain implicit information about interactions that 
might happen in the future or about the possibility 
that a given node, edge, or subgraph disappears from 
the network after some time. For example, in a social 
context, the fact that two people share a large num- 
ber of friends indicates that those two people might be 
friends as well. This kind of analysis is not only use- 
ful in predicting how a network might evolve, but it 
also helps us to make inferences about links that are 
unobserved or are missing from the data. Most of the 
processes of link formation and decay can be predicted 
by looking at the local neighborhood or community of 
the link: nonconnected nodes that have structural sim- 
ilarity tend to become connected. The simplest similar- 
ity measure is the number of shared neighbors in the 
network (or the embeddedness of an edge), but other 
dyadic or neighborhood measures have been proposed. 
Conversely, studying four years’ worth of banker rela- 
tionships in a large organization, Burt found that edges 
between nodes that are very similar do not decay eas- 
ily, and therefore network evolution happens mostly 
at the bridge positions between communities, where 
nodes are structurally different. 

7 Applications 

It would be impossible to cover all the applications of 
network analysis here. Although it has long been used 
in the social sciences, the last few decades have seen 
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many technological, engineering, economic, and bio- 
logical problems being studied using network analysis. 
Here we review some of those applications. 

7.1 Social Networks 

The analysis of social networks has a long tradition, and 
in fact many of the techniques and ideas in network 
analysis are derived from sociology. Social networks 
are defined by a set of actors (persons, organizations, 
etc.) and edges that represent relationships between 
them (friendships, business relationships, etc.). Tradi- 
tional social network analysis was devoted to explain- 
ing the role of network structures, dyadic properties, 
or network positions in social processes. At the level 
of the network, for example, the connectedness or 
cohesion of a group or of the network itself could 
be an explanatory variable in consensus formation, 
shared norms, etc. Dyadic properties of social networks 
refer to concepts like distance between nodes, struc- 
tural equivalence (sharing the same relationships to 
all other nodes), reciprocity (the tendency of nodes to 
form mutual connections), etc., that can be used to 
find structural equivalence classes, that is, types of 
actors in the network. Regarding network positions, 
the most studied concept is that of the use of central- 
ity to determine the most important actor in a social 
network with respect to information sharing, economic 
opportunities, prestige, and so on. 

Different social positions also lead to different net- 
work roles; for example, bridges are nodes that appear 
between communities and thus have a large central- 
ity (see figure 4). Bridges are surrounded by structural 
holes (i.e., the lack of a connection between two nodes 
in communities connected by a bridge), and thus only 
bridges can access information from different sources 
and communities and benefit from their position in 
the network. The potential benefits that their posi- 
tion in a network could yield to a group or individ- 
ual are known as social capital theory (introduced by 
Burt in the 1990s). Intimately related to the concept 
of bridges and structural holes is Granovetter’s theory 
of weak ties: if strong ties are associated with inti- 
mate and intense relationships, Granovetter’s theory is 
that weak ties are associated with bridges, that is, that 
our strongest edges happen within communities (see 
figure 4). Weak ties therefore enable people to reach 
information that is not accessible via strong ties. Gra- 
novetter used a survey of job seekers to prove this; he 
asked people who had found a job through contacts 



Figure 4 The network of email communication between two 
departments (white and gray symbols) from plate 5. Most 
communication happens within departments, and some 
individuals have special roles as bridges between them. 
Also, there are a large number of structural holes in the 
network; weak ties happen between the departments while 
strong ties occur between people in the same department 
(the Granovetter hypothesis). 


how often they saw the person who had helped them 
get their job. The majority were acquaintances rather 
than close friends. In more recent work, Onnela and 
collaborators measured these social theories quantita- 
tively; using a large graph of mobile phone communi- 
cations, they found that people with weak ties (those 
with a smaller number of calls) tend to have larger 
betweenness than those with strong ties. 

Much work has been devoted to understanding 
when and why information propagates in social net- 
works. Understanding how fast and how far informa- 
tion spreads is important in viral marketing, social 
mobilizations, innovation spreading, and computer 
viruses, for example. Although information spreading 
is affected by many factors, including the value of 
the content transmitted and exogenous campaigns, the 
very network structure shapes the speed and reach 
of information spreading. As we show in section 6, 
it is possible to study how the structural proper- 
ties of the network — such as heterogeneity, cluster- 
ing, and communities — influence the spreading, and 
to study what structural properties (centrality, degree, 
k-coreness) of an initial set of nodes are needed to 
maximize the spreading in the network. 
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Although behavior contagion is difficult to assess in 
many contexts, network analysis is widely used in mar- 
keting and social media to detect and predict changes 
of behavior. For example, it is used in the telecommu- 
nication industry to predict churn or product/service 
adoption using the social network of a company’s 
clients obtained through the calls placed between them. 
The basic idea is that behavior like churning or adop- 
tion has a viral component and propagates on the net- 
work. Thus, the probability to churn or buy a particular 
product or service depends on the number of neigh- 
bors in the social network that have already done so. 
The analysis of the social network of clients permits 
one to identify and target those individuals most likely 
to make a purchase. 

Disease spreading is also influenced by an infected 
person’s social network. Here, the approach is dif- 
ferent depending on the scale at which spreading is 
being studied. At small scale, the study of networks 
of sexual contacts can help to understand and con- 
trol the spread of sexually transmitted diseases. These 
networks show high heterogeneity, modular structure, 
and small-world phenomena— properties that promote 
the spread of the disease across the network. In turn, 
since highly heterogeneous networks are susceptible 
to intentional attacks, targeted vaccination of super- 
spreaders might be enough to prevent an epidemic, 
a result that reinforces standard public-health guide- 
lines. At a large scale, we can consider metapopula- 
tion network models, in which nodes are populations 
(groups, patches, cities) and edges account for the prob- 
ability of transmission between populations. In these 
examples, human mobility networks play a major role. 
For example, the worldwide airport transportation net- 
work is highly connected; it is a small world, and its 
structure therefore favors epidemics on a global scale 
(see figure 5). In 2003 the SARS outbreak took just a 
few weeks to spread from Hong Kong to thirty-seven 
countries. Analysis of transportation networks allows 
us to predict which airports or hubs will be most likely 
to promote aggressive spatial spreading. 

7.2 Economics and Finance 

The complexity of interdependencies between differ- 
ent agents, financial instruments, traders, etc., can be 
also studied using network analysis. In these stud- 
ies, the main aim is to understand how the network 
structure impacts the performance of institutions or 
economies and also the role it plays in economic risks 



Figure 5 A network of passenger flows among U.S. airports. 
The network has 489 airports and 5609 different routes 
between them. The diameter of the network is only 8. The 
black line shows one of the largest geodesics in the network: 
that between the airports of Grand Canyon West, AZ, and 
Fort Pierce, FL. 

and their possible mitigation. For example, the 2008 
financial crisis has shown that systemic risk can prop- 
agate rapidly between interconnected financial struc- 
tures. A potentially vulnerable market for contagion of 
financial shocks is the interbank loan market, where 
banks exchange large amounts of capital for short dura- 
tions to accommodate temporal fluctuations. Network 
analysis has shown the important role of the topology 
of that market in the systemic risk of the system. Specif- 
ically, it has been found that contagion of bank failures 
can be promoted by the heterogeneity and density of 
the network, and the fragility of the system to exter- 
nal shocks has also been demonstrated. The impact of 
the structure of financial networks and global markets 
on their stability has attracted the interest of regula- 
tors and central bankers in using network analysis to 
evaluate systemic risk. 

Network analysis is also used to map how organi- 
zational environments affect an organization’s perfor- 
mance, that is, how market transactions, contracts, 
mergers, interlocking board directorates, or strategic 
alliances shape organizations, create innovation, or 
define the future performance of companies in a par- 
ticular sector. For example, in 1994 Saxenian hypoth- 
esized that the dramatic growth of Silicon Valley in 
the previous decades could be explained, in part, by 
the cooperative and informal exchange of information 
among the organizations in the area. At the world level, 
the study of the network of international trade allows 
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us to analyze indirect trade interactions between coun- 
tries, the effect of globalization on the world economy, 
or the cascade effect of financial crises in different 
countries. 

7.3 Biology 

Many biological problems are intrinsically complex; 
they include the interaction of a large number of 
molecules, cells, individuals, or species. Mapping all 
this information into a network allows us to analyze the 
common structure of those problems and to develop 
tools that may prove useful for applications in a wide 
variety of biological problems. 

For example, new noninvasive techniques for mea- 
suring brain structure and activity, such as neuroimag- 
ing (e.g., MAGNETIC RESONANCE IMAGING [VII.10 §4.1] 
(MRI), functional MRI, and diffusion tensor imaging) 
and neurophysiological recordings (e.g., electroenceph- 
alography, magnetoencephalography), have allowed us 
to collect large amounts of spatiotemporal data about 
brain structure and activity. Analyzing the anatomi- 
cal and/or functional connectivity of different areas of 
the brain, researchers have constructed network mod- 
els of the brain. Brain networks ( connectome ) seem 
to be organized in communities or modules, have the 
small-world property, and also present highly influen- 
tial (hubs) nodes; they even seem to be scale free. Those 
communities appear to coincide with known cognitive 
networks or function subdivisions of the human brain, 
while the small diameter of the network seems to allow 
for efficient information processing. 

The recent development of high-throughput tech- 
niques in molecular biology has led to an unprece- 
dented amount of data about the molecular inter- 
actions that occur in biological organisms, e.g., the 
metabolic networks of biochemical reactions between 
metabolic substrates, the interaction networks between 
proteins (inter act ome), and the regulatory networks 
that represent the interactions between genes. A know- 
ledge of the topologies of complex biological networks 
and their impact on biological processes is needed not 
only to understand those processes but also to develop 
more effective treatment strategies for complex dis- 
eases. Most of the available analyses are concerned 
with the application of the concept of centrality, which 
allows us to determine the essential protein or gene (or 
groups of them) in the organism and then apply the 
results to drug target identification. 


7.4 Other Applications 

Network analysis has expanded into many other areas. 
For example, it is used in sports to understand the style 
of play and the performance of teams. In football, by 
analyzing the network of passes between players it was 
found that a large clustering coefficient, or the diversity 
of the distribution of passes (entropy), correlates with 
the performance of the Spanish team that won the 2010 
FIFA World Cup. Similar analysis has been done for bas- 
ketball in the 2010 NBA Playoffs. Another interesting 
and recent application has been to map and under- 
stand the networks of cuisines, recipes, and ingredients 
in cooking. The availability of online recipe reposito- 
ries has allowed researchers to find similarities among 
different regional cuisines in China, and to unveil the 
flavor network in culinary ingredients. 

Since networks shape the information we receive and 
influence our behavior, a major concern is how much 
of our private life is encoded in the network structures 
and dynamics around us. For example, the information 
we leave behind in social networks can be used for iden- 
tity disclosure or to unveil private personal traits. Net- 
work analysis can be used in this context to understand 
how to perform privacy-preserving network analysis, 
typically by graph-modification algorithms in which 
some edges or nodes are changed and/or removed. The 
idea is that these modifications conserve enough of the 
structure to perform analysis globally while hindering 
the identification of individuals and/or personal traits. 

7.5 Software Tools 

Much of the progress in network analysis in recent 
years is down to the availability of not only data but 
also the software tools needed to analyze it and visu- 
alize it. The most well-known tools are those that 
include a graphical user interface, such as UCINet, 
Pajek, Cytoscape, NetMiner, and Gephi. More power- 
ful analyses of large networks can be done using pack- 
ages such as the igraph package (ported to R, Python, 
and C) or the NetworkX library for Python, tools that 
are also widely used in producing high-quality graphics 
(the figures in this article were produced using igraph). 

In the era of large data sets, efficient storage of net- 
works can be achieved using the graph structure of 
nodes, edges, and the relationships between them. A 
number of graph databases have been developed in 
which the network structure is used internally to store 
and query network data. Typically, graph databases 
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scale more naturally to large graphs and are more pow- 
erful for graph-like queries; for example, computing the 
neighborhood of a node or the shortest path between 
two nodes in the graph is much easier and faster than 
in relational databases. Prominent examples of these 
databases are Neo4j and Sparksee. 

8 Outlook 

In this article we have introduced the main mathemati- 
cal tools needed to analyze networks, and we have also 
illustrated how a large variety of complex systems can 
be studied by mapping the interdependencies between 
their constituent units into a network. But network 
analysis is not just a powerful methodology for analyz- 
ing those graphs; it is also a different way of conceiving 
those systems as collective structures with emergent 
behaviors that cannot be explained by the sum of their 
individual parts. For example, the world economy, the 
biological processes that occur in cells or ecosystems, 
mass mobilization, and the performance of organiza- 
tions all depend on the structure and dynamics of the 
whole network rather than on the sum of individual 
behaviors. Recent and future technologies will allow us 
to collect more data more quickly from those systems, 
allowing us to detect and promote more interdepen- 
dencies between the units of the system. For example, 
in the near future it will be possible to completely map 
the connectome in the brain. Or perhaps we will be 
able to monitor the activity within cities on an unprece- 
dented scale or reveal the interdependencies between 
financial and economic activities to help us prevent 
future economic and societal crises. We might therefore 
expect that the observation of systems will deliver more 
and more networked data. Network analysis will be the 
required tool for validating, modeling, and predicting 
the behavior of those systems. 
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IV. 19 Classical Mechanics 

David Tong 


Classical mechanics is an ambitious subject. Its pur- 
pose is to predict the future and reconstruct the past, to 
determine the history of every particle in the universe. 

The fundamental principles of classical mechanics 
were laid down by Galileo and Newton in the sixteenth 
and seventeenth centuries. They provided a framework 
to explain vast swathes of the natural world, from plan- 
ets to tides to falling cats. In later centuries the frame- 
work of classical mechanics was reformulated, most 
notably by Lagrange and Hamilton. While this new way 
of viewing classical mechanics often makes it simpler 
to solve problems, its main advantage lies in the new 
mathematical perspective it offers on the subject. In 
particular, it reveals the door to the quantum world that 
lies beyond the classical. 

This article begins by reviewing the Newtonian frame- 
work, providing examples of important physical sys- 
tems that can be solved. It then goes on to describe 
the Lagrangian and Hamiltonian frameworks of classi- 
cal mechanics. Throughout, the emphasis of the arti- 
cle is more on the role that classical mechanics plays 
in the fundamental laws of physics than on practical 
engineering applications of the theory. 

1 Newtonian Mechanics 

Newtonian mechanics describes the motion of particles, 
which are defined to be objects of insignificant size. 
This means that if we want to say what a particle looks 
like at a given time t, the only information that we have 
to give is its position in R 3 , specified by a 3-vector r = 
(x, y, z). The goal of classical dynamics is to determine 
the vector function r(t) in any given situation. This, in 
turn, tells us the velocity v = dr/dt = r of the particle. 
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1.1 Newton’s Laws of Motion 

The motion of a particle of mass m at the position 
r is governed by a second-order differential equation 
known as Newton’s second law : 

= F(r,r;t), (1) 

dt 

where p = mr is the momentum of the particle and F 
is the force, an input that must be specified. Examples 
will be given below. In general, the force can depend 
on both the position r and the velocity r. It can also 
depend explicitly on time. 

In most situations the mass is independent of time. In 
this case, Newton’s second law reduces to the familiar 
form F = ma, where a = r is the acceleration. We will 
not consider phenomena with time-dependent mass in 
the following. 

Because (1) is a second-order differential equation, 
a unique solution exists only if we specify two initial 
conditions. This has a consequence: if we are given a 
snapshot of some situation and asked what happens 
next, then there is no way of knowing the answer. It is 
not enough to be told only the positions of the particles 
at some point in time; we need to know their velocities 
too. 

1.1.1 Principle of Relativity 

Equation (1) is not quite correct as stated; we must add 
the caveat that it holds in an inertial frame. This is a 
choice of coordinates appropriate for an observer who 
sees a free particle travel in a straight line, also known 
as uniform motion. 

The statement that, in the absence of a force, a par- 
ticle travels in a straight line is sometimes called New- 
ton’s first law. Elowever, setting F = 0 in (1) already tells 
us that free particles travel in straight lines. So is the 
first law nothing more than a special case of the sec- 
ond? In fact, a better formulation of the first law is the 
statement that inertial frames exist. This then sets the 
stage for the second law. 

Inertial frames are not unique. Indeed, there are 
infinitely many of them. Let S be an inertial frame in 
which the position of a particle is measured as r. There 
are then 10 = 3 + 3 + 3 + 1 independent transforma- 
tions S — S' such that S' is also an inertial frame. These 
transformations are the following. 

Spatial translation: r' = r + c for any constant c. 
Rotations: r’ = Or , where O is a 3 x 3 orthogonal 

matrix with 0 T 0 = I. 


Boosts: r' = r + ut for constant velocity u. 

Time translation: t' = t + d for constant d. 

If the motion of a particle is uniform in S, then it 
will also be uniform in S'. These transformations make 
up the Galilean group under which Newton's laws are 
invariant. The physical meaning of these transforma- 
tions is that position, direction, and velocity are rela- 
tive. But acceleration is not. One does not have to accel- 
erate relative to something else. It makes perfect sense 
to simply say that one is, or is not, accelerating. 

1.1.2 Systems of Particles 

The discussion above is restricted to the motion of a 
single particle. It is simple to generalize to many parti- 
cles; we just add an index to everything in sight. Let par- 
ticle i have mass mf and position r j, where i = 1 JV, 

with N the number of particles. Newton’s law now reads 


where Fj is the force on the ith particle. The novelty 
is that forces can now be working between particles. In 
general, we can decompose the force as 

Ft = 1 Fij + F, ext , 
j*i 

where Fij is the force acting on the ith particle due to 
the jth particle, while Ff xl is the external force on the 
ith particle. 

The total mass of the system is M = 7u, . We define 

the center of mass as R = Y.itmri/M and the total 
momentum as P = Pi = MR. 

The center of mass motion is particularly simple. 
From (1), 

i i j±i i 

= si Fij+Fjo+^pr, 

i<j i 

where, in the second line, we have rewritten the sum 
to be over all pairs i < j. At this stage we invoke New- 
ton’s third law of motion', every action has an equal and 
opposite reaction. Or, in equation form, Fy = -T,,. We 
see that the first term vanishes and we are left with 

MR = X f t eX ‘- 

i 

This is identical in form to Newton’s second law (1) for 
a single particle. This is an important formula. It says 
that the center of mass of a system of particles acts as 
if all the mass were concentrated there. In other words, 
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it does not matter if you throw a tennis ball or a very 
lively cat; the center of mass of each traces the same 
path. 

1.2 Forces 

Newton’s second law is not useful until someone else 
tells us what the force F is in any given situation. Here 
we provide several examples of forces. 

1.2.1 Conservative Forces and Energy 

We start by considering time-independent forces that 
are a function only of the position of the particle, F = 
F(r). Within this class there is a special class known as 
conservative forces. These can be expressed as 

F = - VV(r) 

for some scalar function V(r). Systems that admit a 
potential of this form include gravitational, electro- 
static, and interatomic forces. The importance of con- 
servative forces lies in the fact that there is a conserved 
quantity called energy. 

E = \mr ■ r + V (r) = T + V . 

(Recall that the scalar dot product is defined as r ■ r = 
x 2 + y 2 + z 2 .) The function T is known as the kinetic 
energy, V is known as the potential energy. It is simple 
to show that if the equation of motion (1) is obeyed, 
then E does not change over time: 
d£ 

, = (mr + W) r = 0. 

dt 

In section 2.3 we prove a result called Noether’s the- 
orem, which offers a deep explanation of why a con- 
served quantity called energy exists. 

Example (harmonic oscillator). Perhaps the simplest 
example of a conservative force is provided by the har- 
monic oscillator, which describes a particle attached to 
a spring. The particle moves in one dimension, with 
position x(t) and has potential energy V = * kx 2 , 
where fc > 0 is known as the spring constant. The 
resulting force, F = -kx, is known as hooke's law 
[III. 15]. The equation of motion mx = -kx has the 
general solution x = A cos toot + B sin coot, where A 
and B are integration constants. This describes a parti- 
cle oscillating around the origin with angular frequency 
too = yjk/m. 

The harmonic oscillator is by far the most impor- 
tant system in all of theoretical physics. For any sys- 
tem described by a potential energy V, the stable equi- 
librium points are the minima of V. This means that 


if the particle is placed at an equilibrium point then, 
by construction, dV/dx = 0, ensuring that it remains 
at the equilibrium point for all time. Moreover, Tay- 
lor expanding the potential tells us that small devia- 
tions from equilibrium are generically governed by the 
harmonic oscillator. 

1.2.2 Central Forces 

Central forces form a subclass of conservative forces 
in which the potential depends only on the distance to 
the origin: 

V(r) = V(r), 

where r = \r\. The resulting force always points in the 
direction of the origin: 

dV 

F(r) = -VV = -—r, (3) 

dr 

where f = r/r is the unit radial vector. In addi- 
tion to energy, central forces enjoy another conserved 
quantity, known as angular momentum : 

L = mr x r, 

where x denotes the cross product [1.2 §24]. Notice 
that, in contrast to the linear momentum p = mr, the 
angular momentum L depends on the choice of ori- 
gin. It is perpendicular to both the position and the 
momentum. 

When we take the time derivative of L we get two 
terms. But one of these contains r x r = 0, and we are 
left with 

dL 

— = mr x r = r x F. 
dt 

The quantity t = rxF is called the torque. For a general 
force F, we find an equation that is very similar to New- 
ton’s law: di/df = t. However, for central forces (3), F 
is parallel to the position r and the torque vanishes. We 
find that angular momentum is conserved: 


As with energy, we will gain a better understanding of 
why L is conserved when we prove Noether’s theorem 
in section 2.3. For now, note that I is a constant vector 
and, by construction, L ■ r = 0, which means that motion 
governed by a central potential takes place in a plane 
perpendicular to the vector L. 

1.2.3 Gravity 

To the best of our knowledge, there are four fundamen- 
tal forces in nature. They are gravity, electromagnetism, 
the strong nuclear force, and the weak nuclear force. 
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The two nuclear forces operate only on small scales, 
comparable with the size of the nucleus of an atom, 
and it makes little physical sense to discuss them with- 
out quantum mechanics. We will discuss the remaining 
two, starting with Newtonian gravity. 

Gravity is a conservative force. Consider a particle of 
mass M fixed at the origin. A particle of mass m moving 
in its presence experiences a central potential energy 


... , GMm 

V(r) = . 

r 


(4) 


Here G is Newton’s constant; it determines the strength 
of the gravitational force and is given by G ~ 6.67 x 
10~ n m 3 kg -1 s~ 2 . The force on the particle, F = — VV, 
points toward the origin and is proportional to 1/r 2 . 
For this reason it is called the inverse-square law. 

Motion governed by the potential (4) describes the 
orbits of planets around the sun and is known as 
the Kepler problem. It is not difficult to solve New- 
ton’s equation for this potential, and solutions can be 
found in all of the references in the further reading 
section below. Instead, here we present a very slick, 
but indirect, method to find the orbits of the planets. 
We have already seen that the conservation of angu- 
lar momentum ensures that the motion takes place in 
a plane. However, the potential (4) is special since it 
admits yet another conserved quantity known as the 
Laplace-Runge-Lenz vector: 


rxL r 

GMm r' 


With a little algebra, one can show that de/dt = 0. Its 
magnitude e = \e\ satisfies 


ercosO = L 2 /GMm 2 - r. 


where 0 is the angle between L and r. This is the 
equation for a conic section. The solution with e < 1 
describes an ellipse, with the origin at one of the foci. 
For the special case of e = 0, the orbit is a circle. Note 
that these orbits are closed: the particle periodically 
returns to the same position. This is not generally true 
for orbits in central potentials other than (4). The solu- 
tions with e > 1 are not closed orbits; these describe 
hyperbolas. 

The elliptical solutions with e < 1 describe the plan- 
etary orbits, with the sun sitting at the focus. Nearly 
all planets in the solar system have e < 0.1, which 
means that their orbits are approximately circular. The 
only exception is Mercury, the closest planet to the sun, 
which has e ~ 0.2. In contrast, some comets have very 
eccentric orbits. The most famous, Halley’s Comet, has 
e ~ 0.97. 


We could try to extend our analysis to the problem 
of three objects all moving under their mutual grav- 
ity, but here things are dramatically harder. No gen- 
eral solution to the three-body problem is known, and 
to answer any practical questions one must resort to 
numerical methods. Historically, though, the study of 
the three-body problem has led to a number of new 
mathematical developments, including chaos. 


1.2.4 Electromagnetism 


Throughout the universe, at each point in space there 
exist two vectors, E and B. These are known as the 
electric and magnetic fields. The laws governing E and 
B are called maxwell’s equations [111.22]. An impor- 
tant application of these equations is described in 
MAGNETOHYDRODYNAMICS [IV. 2 9]. 

For the purposes of this article, the role of E and B 
is to guide any particle that carries electric charge. The 
force experienced by a particle with electric charge q is 
called the Lorentz force: 

F = q(E(r) + r x B(r)). (5) 

Here, the notation E(r) and B(r) emphasizes that the 
electric and magnetic fields are functions of position. 
The term r x B involves the vector cross product. 

In principle, both E and B can change in time. How- 
ever, here we will consider only situations in which they 
are static. In this case, the electric held is always of the 
form 

E = -Vc/> 


for some function called the electric potential or 
the scalar potential. This means that a static electric 
held gives rise to a conservative force. The electric 
potential is related to the potential energy by V = q<p. 

As an example, consider the electric held due to a 
particle of charge Q fixed at the origin. This is given by 


E = -V 


\4tTEorJ 


Q r 
4tTEq T 2 " 


( 6 ) 


The quantity fo has the grand name the permittivity / 
of free space and is a constant given by £o ~ 8.85 x 
10~ 12 m~ 3 kg -1 s 2 C 2 , where C stands for Coulomb, 
the unit of electric charge. The quantity £o should be 
thought of as characterizing the strength of the electric 
interaction. 


The force between two particles with charges Q and 
q is F = qE, with E given by (6). This is known as 
the Coulomb force. It is a remarkable fact that, mathe- 
matically, the force is identical to the Newtonian grav- 
itational force arising from (4): both forces have the 
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characteristic inverse-square form. This means that 
the classical solutions describing an electron orbiting 
a proton, say, are identical to those describing the 
planets orbiting the sun. 

We turn now to magnetic fields. These give rise to 
a velocity-dependent force (5) with magnitude propor- 
tional to the speed of the particle, but with direction 
perpendicular to that of the particle. The magnetic 
force does not contribute to the energy of the particle. 

For a constant magnetic field B = (0,0, B), the 
equations of motion read 

mx = qBy, my = -qBx, 

together with z = 0. This last equation tells us that the 
particle travels at constant velocity in the z-direction 
parallel to B. The equations in the xy-plane, perpendic- 
ular to B, are also easily solved to reveal that a magnetic 
held causes particles to move in circles: 

V V 

x= — (coscot-1) and y = sin cot, 

to to 

where v is the speed and to = qB/m is the cyclotron 

frequency. The time to undergo a full circle is fixed: 

T = 2tt/co, independent of v. The bigger the circle, the 

faster the particle moves. 

1.2.5 Friction 

At a fundamental level, energy is always conserved. 
However, in many everyday processes this does not 
appear to be the case. At a microscopic level, the kinetic 
energy of an object is transferred to the motion of 
many atoms, where it manifests itself as an increase 
in temperature. If we do not care about what all these 
atoms are doing, it is useful to package our igno- 
rance into a single macroscopic force that we call fric- 
tion. In practice, friction forces are important in nearly 
all applications, not just in questions in fundamental 
physics. 

There are a number of different kinds of friction 
forces. When two solid objects are in contact they expe- 
rience dry friction. Experimentally, one finds that the 
complicated dynamics involved in friction can often be 
summarized by a single force that opposes the motion. 
This force has magnitude F = pR, where R is the com- 
ponent of the reaction force normal to the floor and p 
is a constant called the coefficient of friction. For steel 
rubbing against steel, p ~ 0.6. With a layer of grease 
added between the metals, it drops to p « 0.1. For steel 
rubbing against ice, it is as low as p a 0.02. 

A somewhat different form of friction, known as 
drag, occurs when an object moves through a fluid, 


either a liquid or a gas. The resistive force is opposite 
to the direction of the velocity and, typically, falls into 
one of two categories. 

Linear drag is described by F = -yv, where the coef- 
ficient of friction, y, is a constant. This form of drag 
holds for objects moving slowly through very viscous 
fluids. For a spherical object of radius L, there is a for- 
mula due to Stokes that gives y = 6nqL, where q is the 
dynamic viscosity of the fluid. 

In contrast, quadratic drag is described by F = 
-y\v\v, where, again, y is a constant. For quadratic 
friction, y is usually proportional to the surface area 
of the object, i.e., y oc L 2 . Quadratic drag holds for 
fast-moving objects in less viscous fluids. This includes 
objects falling in air. 

The kind of drag experienced by an object is deter- 
mined by the Reynolds number R = pvL 2 /q, where p 
is the density of the fluid and q is the viscosity. For 
R «: 1, linear drag dominates; for R » 1, quadratic 
friction dominates. 

Systems that suffer any kind of friction are dissipa- 
tive. They do not have a conserved energy. 

To illustrate the effect of friction we return to the 
harmonic oscillator that we met in section 1.2.1. Adding 
a linear drag term to the equation of motion gives the 
damped harmonic oscillator: 

mx = -yx — kx. 

We can look for solutions of the form x = e 1 ^. The 
results fall into one of the following three categories 
depending on the relative values of the natural fre- 
quency (Og = k/m and the magnitude of friction a = 
y /2m. 

Underdamped: cog > a 2 . The solution takes the form 
x = e~ at (Ae int + Be~ int ), where Q = ^cOq - a 2 . 
Here, the system oscillates with a frequency Q < 
coo, while the amplitude of the oscillations decays 
exponentially. 

Overdamped: coq < a 2 . The general s olution i s now 
x = Q- at (Ae Qt +Be~ ot ), where Q = ^<x 2 - cog. There 
are no oscillations. 

Critical damping: cog = a 2 . For this special case, the 
general solution is x = (A + Bt)e~ at . Again, there 
are no oscillations, but the system does achieve some 
mild linear growth for times t < 1 /«, after which it 
decays away. 

In each of these cases, the solutions tend asymptoti- 
cally to x = 0 at large times. Energy is not conserved. 
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2 The Lagrangian Formulation 

There are two reformulations of Newtonian mechanics: 
the first is due to Lagrange and the second to Hamil- 
ton. These new approaches are not particularly useful 
in modeling phenomena in which friction plays a dom- 
inant role. However, when energy is conserved these 
reformulations provide a new perspective on classical 
mechanics, one that is more elegant, is more power- 
ful, and, ultimately, provides the link to more sophisti- 
cated theories of physics, such as quantum mechanics. 
The starting point is one of the most profound ideas in 
theoretical physics: the principle of least action. 


2.1 The Principle of Least Action 


First, let us get our notation right. Part of the power 
of the Lagrangian formulation over the Newtonian 
approach is that it does away with vectors in favor 
of more general coordinates. We start by doing this 
trivially. Let us rewrite the positions of N particles 

with coordinates r, as x a , where a = 1 3 N = n. 

Newton’s equations (2) then read 


3V , 7 , 

Pa = ~dx«' (7) 

where p a = m a x a . The number of degrees of freedom 
of the system is said to be n. These parametrize an n- 
dimensional space known as the configuration space C. 
Each point in C specifies a configuration of the system 
(i.e., the positions of all N particles). Time evolution 
gives rise to a curve in C. 

Define the Lagrangian to be a function of the posi- 
tions x a and the velocities x a of all the particles. It is 
formulated as follows: 


L(x a ,x a ) = T(x a ) - V(x a ), (8) 

where T = is the kinetic energy and 

V(x a ) is the potential energy. Note the minus sign 
between T and V. To describe the principle of least 
action, we consider all smooth paths x a (t) in C with 
fixed endpoints such that 

X a (ti) = X&ttai and =*final- 

Of all these possible paths, only one is the true path 
taken by the system. Which one? 

To each path let us assign a number called the action 
S , defined as 

ft r 

S[x fl (t)] = J L(x a (t),x a (t))dt. 

The action is a functional (i.e., a function of the path 
that is itself a function). The principle of least action is 
the following result. 


Theorem (principle of least action). The actual path 
taken by the system is an extremum of S. 


Proof. Consider varying a given path slightly: 

x a (t) - x a (t ) + 5x a (t), 


where we fix the endpoints of the path by demanding 
that 8x a (ti) = 5x a (t [ ) = 0. Then the change in the 
action is 

fff 

5S = <5 Ldt 



In this last equation we are using the summation con- 
vention, according to which any term with a repeated 
a or b index is summed. This means that this term, 
and similar terms in subsequent equations, should be 
thought of as including an implicit ■ The second 
term above includes 5x a = d(Sx a )/dt and can be 
integrated by parts to get 


5S 



d_ 

dt 



8x a df 


[ 


dL 

dx a 



But the final term vanishes since we have fixed the end- 
points of the path so that Sx a (ti) = 8x a (ti ) = 0. The 
requirement that the action is an extremum says that 
SS = 0 for all changes in the path 5x a (t). We see that 
this holds if and only if 



These are known as the euler-lagrange equations 
[III. 12]. To finish the proof we need only show that these 
equations are equivalent to Newton’s. From the dehni- 
tion of the Lagrangian (8), we have 51/ dx a = -dV/dx a , 
while dL/dx a = p a . It is then easy to see that equations 
(9) are indeed equivalent to (7). □ 


The principle of least action is an example of a vari- 
ational principle of the type discussed further in cal- 
culus of variations [IV.6]. The path of the particle is 
clewed globally, as a whole rather than at any instance 
in time. At first this may appear somewhat teleological. 
Yet, as we have shown above, this perspective is entirely 
equivalent to the more local Newtonian methodology. 

The principle of least action is a slight misnomer. The 
proof requires only that 5S = 0; it does not specify 
whether it is a maximum or a minimum of S. Since L = 
T - V, we can always increase S by taking a very fast, 
wiggly path with T » 0, so the true path is never a 
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maximum. However, it may be either a minimum or a 
saddle point. “Principle of stationary action” would be 
a more accurate, but less catchy, name. It is sometimes 
called “Hamilton's principle.” 

Somewhat astonishingly, all the fundamental laws of 
physics can be written in terms of an action principle. 
This includes electromagnetism, general relativity, the 
standard model of particle physics, and attempts to 
go beyond the known laws of physics such as string 
theory. There is also a beautiful generalization of the 
action principle to quantum mechanics that is due to 
Richard Feynman. It is known as the theory of path inte- 
grals, and in it the particle takes all paths with some 
weight determined by S. 

Returning to classical mechanics, there are two very 
important reasons for working with the Lagrangian for- 
mulation. The first is that the Euler-Lagrange equations 
hold in any coordinate system, while Newton’s laws are 
restricted to an inertial frame. The second is the ease 
with which we can deal with certain types of constraints 
in the Lagrangian system. We discuss the former of 
these reasons below, but first let us look at an example. 

Example (the Lorentz force). A particle with charge 
q moving in a background electric and magnetic field 
experiences the Lorentz force law (5). A static electric 
held E = -V<p gives rise to a conservative force and fits 
naturally into the Lagrangian formulation. But in the 
presence of a magnetic held B, it is less obvious that the 
equation of motion can be written using a Lagrangian. 
To do so, we hrst need to introduce the vector potential 
A and write (possibly time-dependent) magnetic and 
electric helds as 

1 3 A 

B = V x A, E=-V<b , 

c at 

where c is the speed of light. One can then check 
that the Euler-Lagrange equations arising from the 
Lagrangian 

L = \mr ■ r - q(^p ~ ■ aJ (10) 

coincide with Newton's equations for the Lorentz force 
law (5). 

2.2 Changing Coordinate Systems 

We stressed in section 1 that Newton's equation of 
motion (2) holds only in inertial frames. In contrast, 
Lagrange’s equations hold in any coordinate system. 
This means that we could choose to work with any 
coordinates 

q a = q a (xi,...,X3i v;f), 


where we have included the possibility of using a coor- 
dinate system that changes with time t. The Lagrangian 
can thenbe written as a function of! = L(q a ,q a \t), and 
the equations of motion (9) are equivalent to 


d_ 

dt 



31 

dq a 


= 0 . 


( 11 ) 


One can prove the equivalence of (9) and (11) through 
application of the chain rule. Alternatively, one can note 
that the principle of least action is a statement about 
the path taken and makes no mention of the coordi- 
nate used; it must therefore be true in all coordinate 
systems. 

The general variables q a are called generalized coor- 
dinates', the variables p a = dL/dq a are called general- 
ized momenta. These coincide with what we usually call 
“momenta” only in Cartesian coordinates. 


2.2.1 Rotating Coordinate Systems 

We can illustrate the flexibility of the Lagrangian 
approach by deriving the fictitious forces at play in a 
noninertial, rotating coordinate system. Consider a free 
particle with Lagrangian 

L = \mr ■ r. 

Now measure the motion of the particle with respect to 
a coordinate system that is rotating with constant angu- 
lar velocity to = (0, 0, to) about the z-axis. Denote the 
coordinates in the rotating frame as r' = (x',y',z r ). 
We have the relationships z' = z and 

x = xcos cot + y sintot, 
y' = y cos cot - x sintot. 

These expressions can be substituted directly into the 
Lagrangian to find L in terms of the rotating coordin- 
ates: 

L = \m[(x' - coy') 2 + (y' + cox') 2 + z 2 ] 

= \m(r' + to x r' ) ■ (r 1 + to x r'). 

We can now derive the Euler-Lagrange equations in the 
rotating frame, differentiating L with respect to r a ' and 
r a ' . We find that 

m(r' + to x (to x r') + 2to x r') = 0. 

We learn that in the rotating frame the particle does 
not follow a straight line with r' = 0. Instead, we find 
the appearance of two extra terms in the equation of 
motion. 

The term to x (to x r') is called the centrifugal force. 
It points outward in the plane perpendicular to to with 
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magnitude mtv 2 \r' ± \ = m\v L \ 2 l\r’ L \, where the sub- 
script x denotes the projection perpendicular to to. 
This is the force you feel pinning you to the side of 
a car when you take a corner too fast. 

The term -2 to x r' is called the Coriolis force. Notice 
that it is mathematically identical to the Lorentz force 
(5) for a particle moving in a constant magnetic field. 
In myth the Coriolis force determines the direction in 
which the water draining from a sink rotates, but in 
practice it is too small on domestic scales to have a 
noticeable effect. It is, however, significant on large 
length scales, where it dictates the circulation of oceans 
and the atmosphere. 

The centrifugal and Coriolis forces are referred to 
as fictitious forces because they are a result of the 
reference frame rather than any interaction. However, 
we should not underestimate their importance just 
because they are “fictitious.” According to Einstein’s 
theory of general relativity, gravity is also a fictitious 
force, on the same footing as the Coriolis and centrifu- 
gal forces. 

2.3 Noether’s Theorem 

A symmetry of a physical system is an invariance under 
a transformation. We have already met a number of 
symmetries in our discussion of inertial frames; the 
laws of (nonrelativistic) physics are invariant under the 
Galilean group, composed of rotations and translations 
in space and time. A beautiful theorem due to Emmy 
Noether relates these symmetries to conservation laws. 

Consider a one-parameter family of maps between 
generalized coordinates 

q a (t) -> Q a (s, t), s£l, 

such that Q a (0, t ) = q a (t). This transformation is said 
to be a continuous symmetry of the Lagrangian L if 

y s L(Q a (s,t),Q a (s,t)\t) = 0. 

Noether’s theorem states that for each such symme- 
try there exists a conserved quantity. The proof is 
straightforward. We compute 

3L _ _dL_ dQ^_ dL dQ a 
ds 3 Q a ds + 3 Q_ a 3s 

Evaluated at s = 0, we have 


dL 

dL 3 Q a 

+ 

5 = 0 

dL 3 Q a 


ds 

5=0 dq a 

ds 

dq a ds 

5=0 


- d ( 

dL \ 

dQ_ a 

CD 

dQ a 


d t\ 

dq a ) 

ds 

5=0 3q a 

ds 


where we have used the Euler-Lagrange equations. The 
result is a total derivative: 


31 

ds 


_ d 1 dL 3Q fl 
5=0 dt\dq a ds 


= 0 . 


5 = 0/ 


We learn that the quantity £ fl (3I/3q a )(3Q a /3s), eval- 
uated at s = 0, is constant for all time whenever the 
equations of motion are obeyed. Notice that the proof 
of Noether's theorem is constructive: it does not just 
tell us about the existence of a conserved quantity, it 
also tells us what that quantity is. 


Homogeneity of Space 

Consider a closed system of N particles interacting 
through a potential V ( | r; - rj | ) that, as the nota- 
tion suggests, depends only on the relative distances 
between the particles i,j= 1 The Lagrangian 

L = \ ~ V(\ri -rj\) ( 12 ) 

i 

has the symmetry of translation: r; -» r, + sn for any 
vector n and for any real number s, so L(ri,fi,t ) = 
L(ri + sn,ri,t). This is the statement that space is 
homogeneous. From Noether’s theorem we can com- 
pute the conserved quantity associated with transla- 
tions. It is Y.iPi ' w, which we recognize as the total lin- 
ear momentum in the direction n. Since this holds for 
all n, we conclude that Xi Pi is conserved. The familiar 
fact that the total linear momentum is conserved is due 
to the homogeneity of space. 


Homogeneity of Time 

The laws of physics are the same today as they were 
yesterday. This invariance under time translations also 
gives rise to a conserved quantity. Mathematically, this 
means that L is invariant under t -» t + s, or, in other 
words, dL/dt = 0. It is straightforward to check that 
this condition ensures that 

H = ^q a (BLIdq a ) -L 

a 

is conserved. This is the energy of the system. We learn 
that time is to energy what space is to momentum, a les- 
son that resonates into the relativistic world of Einstein. 
We will meet the quantity H again in the next section, 
where, viewed from a slightly different perspective, it 
will be rebranded the Hamiltonian. 

One can also show that the isotropy of space, mean- 
ing invariance under rotations, gives rise to the conser- 
vation of angular momentum. In fact, suitably general- 
ized, it turns out that ail conservation laws in nature are 
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related to symmetries through Noether’s theorem. This 
includes the conservation of electric charge and the 
conservation of particles such as protons and neutrons 
(known as baryons). 

3 The Hamiltonian Formulation 

The next step in the formulation of classical mechanics 
is due to Hamilton. The basic idea is to place the gen- 
eralized coordinates q a and the generalized momenta 
p a = dL/dq a on a more symmetric footing. 

We can start by thinking pictorially. Recall that the 
coordinates {q a } define a point in n-dimensional con- 
figuration space C. Time evolution is a path in C. How- 
ever, the state of the system is defined by {q a } and 
[p a -} in the sense that this information will allow us to 
determine the state at all times in the future. The pair 
{q a , Pa } defines a point in 2n-dimensional phase space. 
Since a point in phase space is sufficient to determine 
the future evolution of the system, paths in phase space 
can never cross. We say that evolution is governed by a 
flow in phase space. 


For Lagrangians with explicit time dependence these 
are supplemented with -dL/dt = dH/dt. These are 
Hamilton’s equations. We have replaced n second- 
order differential equations by 2 n first-order differ- 
ential equations for q a and p a . Recast in this man- 
ner, Hamilton's equations are ideally suited for dealing 
with initial-value problems rather than the boundary- 
value problems that are more natural in the Lagrangian 
formulation. 

A Particle in a Potential 

The simplest example is a single particle moving in a 
potential. The Lagrangian is L = \mr ■ r - V(r) and 
the momentum p = mr. The steps above give us the 
Hamiltonian 

H = p ■ r - L = - — p ■ p + V(r). 

Am 

Hamilton’s equations are simply r = p/m and p = 
- VV, both of which are familiar: the first is the defini- 
tion of momentum in terms of velocity; the second is 
Newton’s equation for this system. 


3.1 Hamilton’s Equations 


The Lorentz Force 


The Lagrangian L (q a , q a \ t) is a function of the coordin- 
ates q a , their time derivatives q a , and (possibly) time. 
We define the Hamiltonian to be the Legendre trans- 
form of the Lagrangian with respect to the q a variables: 

n 

H(q a , Pa, t) = X Pad a ~ L(q a ,q a , t), 

a= 1 

where q a is eliminated from the right-hand side in 
favor of p a by using p a = dL/dq a = p a (q a ,q a \t) and 
inverting to get q a = q a (q a , p a \t). Now we look at the 
variation of H. Once again employing the summation 
convention, we have 


8H = ( 8p a q a + p a 5q a ) 


(^Sq a 

\dq a 


d L 
dq a 


8q a 


dL 

dt 


St) 


, a dL dL 

= 5pa « ~M~ a5q ~ ~dt 5t ' 

But we know that this can be rewritten as 
SH 


d H s a dH d H _ 

W 5a + Wa 5Va + ^ 5t - 


We now equate terms. We also make use of the Euler- 
Lagrange equations, which can be written as p a = 
dL/dq a . The end result is 

dH -a dH 
dq a ’ ^ d p a 


Pa 


A charged particle moving in an electric and magnetic 
held is described by the Lagrangian (10). From this we 
can compute the momentum, p = mr + ( q/c)A , which 
now differs from what we usually call momentum by 
the addition of the vector potential A. The Hamiltonian 
is 

H(P ’ r)= 2 nt( p -? A )( p -? A W- 

One can check that Hamilton's equations reduce to 
Newton’s equations with the Lorentz force law (5). 

3.2 Looking Forward 

The advantage of the Hamiltonian formulation over 
the Lagrangian is not really a practical one. Instead, 
the true value of the formulation lies in what it tells 
us about the structure of classical mechanics. It is, at 
heart, a geometric formulation of classical mechanics 
and can be expressed in more abstract form using the 
language of symplectic geometry. Moreover, the Hamil- 
tonian framework provides a springboard for later 
developments, including chaos theory and integrabil- 
ity. (See the article on chaos [II.3] in this book.) Perhaps 
most importantly, the Hamiltonian offers the most 
direct link to more fundamental theories of physics and 
particularly to quantum mechanics. 
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Further Reading 

This article is based on two lecture courses given by 
the author to undergraduates at the University of Cam- 
bridge. Full lecture notes for both courses can be down- 
loaded from the author's Web page: www.damtp.cam.ac 
.uk/user/tong/teaching.html. 

Lev Landau and Evgeny Lifshitz’s Mechanics (Butter- 
worth-Heinemann, Oxford, 3rd edn, 1982) is one of 
the most concise, elegant, and conceptually gorgeous 
textbooks ever written. 

A more modern, pedagogical approach to Newtonian 
mechanics can be found in Mary Lunn’s A First Course in 
Mechanics (Oxford Science Publications, Oxford, 1991). 

A good introduction to the Lagrangian and Hamil- 
tonian formulation is Louis Hand and Janet Finch’s 
Analytic Mechanics (Cambridge University Press, Cam- 
bridge, 1998). 


IV.20 Dynamical Systems 

Philip Holmes 


1 Introduction 

The theory of dynamical systems describes the con- 
struction and analysis of models for things that move 
or evolve over time and space. It combines analytical, 
geometrical, topological, and numerical methods for 
studying differential equations and iterated mappings. 
These methods stem from the work of Newton and his 
successors, the great natural philosophers of the eigh- 
teenth and nineteenth centuries, and in particular from 
Henri Poincare. As such, the study of dynamical sys- 
tems is normal mathematics rather than the paradigm 
shift that some popular accounts have claimed for 
chaos theory [II.3]. Nonetheless, problems from the 
applied sciences have continued to strongly influence 
and motivate it, especially over the past half century 
(see Aubin and Dahan Dalmedico (2002) for a socio- 
hlstorical discussion of developments in the turbulent 
decade around 1970). 

It is a capacious field, implying different things to 
different people, including deterministic and stochas- 
tic systems of finite or infinite dimensions, ergodic 
theory [II. 3], and holomorphic dynamics (the study of 
iterated functions on the complex plane). I shall focus 
on ordinary differential equations (ODEs) and iterated 
maps defined on Euclidean space R”, but note that 
the theory generalizes to manifolds, much of it gen- 
eralizes to infinite dimensions, and some of it general- 
izes to stochastic systems. In contrast to classical ODE 


theory, which focuses on specific initial or boundary- 
value problems, dynamical systems theory brings a 
qualitative and geometrical approach to the analysis 
of nonlinear ODEs, addressing the existence, stability, 
and global behavior of sets of solutions, rather than 
seeking exact or approximate expressions for individ- 
ual solutions (see ordinary differential equations 
[IV.2]). 

We consider systems of ODEs (1) and discrete map- 
pings (2): 

xj = fj(x i,x 2 ,...,x„;p), (1) 

xj(l+ 1) = Fj(xi(Z),...,x„(Z);p), (2) 

where j = 1, . . . , n\ Xj denotes the time derivative; fj 

and Fj are smooth, real- valued functions; the Xj are 
state variables', and p is a control parameter. In solving 
(1) or (2) with given initial conditions x(0) to obtain 
orbits x(t) = (xi(t), . . . ,x n (t)) or {x(Z)} ; “ 0 , p is kept 
fixed. 

In studying the ODE (1), one seeks to describe the 
behavior of the flow map 

x(f,x(0)) = <p t (x(0)) or <pt-U^R n , (3) 

generated by the vector field fix) = (/i(x),..., 
fnix)), which transports initial points x(0) 6 U s R” 
to their images at time t: 4>t(x( 0)). If <pt can be found, 
then, fixing a time interval t = T, (1) reduces to (2), 
but explicit formulas can be derived only in exceptional 
cases, and in any case the study of iterated maps is no 
less complicated than that of ODEs. Of course, numeri- 
cal algorithms for ODEs can provide excellent approxi- 
mations of </>r (see numerical solution of ordinary 
differential equations [IV. 12]). 

Before describing the theory, I sketch some histori- 
cal threads through the work of A. A. Andronov, V. 1. 
Arnold, G. D. Birkhoff, A. N. Kolmogorov, S. Smale, and 
other key figures. Two motivating examples are then 
introduced: one physical and one mathematical. 

2 A Brief History 

Dynamical systems theory began with Poincare’s work 
on differential equations and celestial mechanics from 
1879 to 1912. In addition to special methods for two- 
dimensional ODEs, many other central concepts first 
appeared in Poincare’s work, including invariant mani- 
folds (smooth hypersurfaces composed of families of 
orbits), first-return (Poincare) maps for the study of 
periodic motions, bifurcations, coordinate changes to 
normal forms that simplify analyses, and perturbation 
methods. Notably, Poincare realized that, due to the 
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presence of “doubly asymptotic” points or homoclinic 
and heteroclinic orbits, certain differential equations 
describing mechanical systems with N ^ 2 degrees 
of freedom were not integrable. More precisely, they 
do not possess enough independent, analytic functions 
of x that remain constant on solutions, which would 
imply that the geometry of invariant manifolds is rel- 
atively simple. During this period A. M. Lyapunov also 
made important contributions to stability theory. 

Birklroff extended Poincare’s work, in particular prov- 
ing that maps from an annulus to itself having two peri- 
odic orbits with different periods also contain compli- 
cated limit sets that separate the domains of attraction 
of those orbits. This prepared the way for Cartwright 
and Littlewood’s proof that the van der Pol equation— 
a periodically forced nonlinear oscillator— possesses 
infinitely many periodic orbits and a set of nonperiodic 
orbits “of the power of the continuum.” N. Levinson 
subsequently drew this to Smale’s attention, prompt- 
ing his construction of the horseshoe map: a proto- 
typical example v\4th an unstable chaotic set, which is 
nonetheless structurally stable, implying that the flows 
of the original and perturbed systems are topologically 
equivalent (homeomorphic). The qualitative behavior 
of structurally stable vector fields and maps survives 
small perturbations. 

Structural stability had been introduced by Andronov 
and L. S. Pontryagin in 1937 under the name “sys- 
temes grossieres” (coarse systems). From this perspec- 
tive, a bifurcation occurs when a system becomes struc- 
turally unstable as a parameter varies; its behavior 
changes as one passes through the bifurcation point. 
Andronov’s group in Gorky (now Nizhni Novgorod) 
also did important work in bifurcation theory. M. M. 
Peixoto’s characterization of structurally stable flows 
on two-dimensional manifolds required that they pos- 
sess only finitely many fixed points and periodic orbits, 
leading to Smale’s conjecture that the same should hold 
in higher dimensions. The horseshoe map, with its infi- 
nite set of periodic points, can be seen as a return 
map for a three-dimensional flow and thus provided a 
counterexample to this conjecture. Moreover, it showed 
that the chaos glimpsed by Poincare was prevalent in 
ODEs of dimensions n ^ 3 as well as in maps of dimen- 
sions n ^ 2. Smale’s influential work in the 1960s 
introduced mathematicians to the field, but these ideas 
did not reach the applied mathematical mainstream for 
some time. 

As was common in the Soviet Union, Andronov’s 
group maintained strong connections between abstract 


theory and applications, in its case, nonlinear oscilla- 
tors and waves, electronic circuits, and control theory. 
In Moscow, Kolmogorov and his students (including 
D. V. Anosov, Ya. G. Sinai, and Arnold) did foundational 
work on ergodic theory, billiards, and geodesic flows, 
with links to mathematical physics. Smale’s visits in 
1961 and during the International Congress of Math- 
ematicians in 1966 helped introduce their work to the 
wider mathematical world. 

Lorenz's paper on a three-dimensional ODE modeling 
Rayleigh-Benard convection was done almost indepen- 
dently of the work described above, although in pre- 
senting his discovery of sensitive dependence on initial 
conditions in 1963, Lorenz appealed to Birkhoff's work. 
An earlier, extra-mathematical, discovery had taken 
place in 1961 when Ueda, a graduate student in elec- 
trical engineering at Kyoto University, observed irreg- 
ular motions in analogue computer simulations of a 
periodically forced van der Pol-Duffing equation. 

3 Two Dynamical Systems 

To motivate the theory described below, we first intro- 
duce a problem from classical mechanics and then a 
mathematical toy that is, perhaps surprisingly, related 
to it. 


3.1 The Double Pendulum 

Consider a pendulum comprising two rigid links, the 
first rotating about a fixed pivot, the second pivoting 
about the end of the first (see figure 1(a)). Under New- 
tonian mechanics, the four angles and angular veloci- 
ties 0 1 , 02 , 0i, 02 describe the pendulum’s state space, 
and neglecting air resistance and friction, conservation 
of energy implies that all motions started with given 
potential and kinetic energy lie on a three-dimensional 
subset of state space (typically, a smooth manifold). 
However, unlike the one-link pendulum, its motions 
are not generally periodic, and small changes in initial 
conditions yield orbits that rapidly diverge. Part (b) of 
figure 1 illustrates this sensitive dependence. 


3.2 The Doubling Machine 


Next we describe a piecewise-linear mapping defined 
on the interval [0,1] by the rule 


h(x) 


2x 

2x 


if 0 ^ x < j , 
1 if i < x < 1 


(4) 


(see figure 2). An orbit of h is the sequence {x n }“ =0 
obtained by repeatedly doubling the initial value xo and 
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Figure 1 (a) The double pendulum, (b) The numbers of 
full turns, 02 , executed by three different orbits released 
from rest with 0i(O) = 90°, 02(0) = -90°, and 02(0) = 
-(90 ± 10 _6 )°. 


subtracting the integer part at each step: 

0.2753 ~ 0.5506 ~ 1.1012 = 0.1012 - 0.2024 ~ ■ ■ ■ . 


To understand the sensitive dependence on initial 
conditions and its consequences, we represent the 
numbers between 0 and 1 in binary form: 


CL\ Cl ? CL 3 

X0 = T + 22 + 23 


2 J 


where each coefficient a# is either 0 or 1. Thus, 


xi = h(x 0 ) = 2 X yy 


0-1 + — + 7^2 +■■ 


but since the integer part is removed on every iteration, 
we have 


Cl 2 U3 Cl 4 

Xl = T + 22 + 23" 


and in general, 


Xk 


Ok+l 

2 


O-k+2 

2 2 


■ = s 
1=1 


Oj+k 

21 ' 



just as multiplication by 10 shifts a decimal point. Were 
xo known to infinite accuracy, with all the a.k spec- 
ified, the current state, Xk, at each step would also 
be known exactly. But given only the first N coeffi- 
cients (ai, fl 2 , ■ ■ ■ , ajv), after N iterations, one cannot 
even determine whether Xjy+i lies above or below ^ ■ 
Moreover, if two points differ only at the IVth binary 
place and thus lie within (^) N_1 , after N iterations 
they lie on opposite sides of | and thereafter behave 
essentially independently. Repeated doubling amplifies 
small differences. 

The binary representation exemplifies symbolic dy- 
namics. To every infinite sequence of zeros and ones 
there corresponds a point in [0,1], and vice versa. 
Hence, any random sequence corresponds to a state 
xq 6 [0,1] whose orbit h k (x 0 ) realizes that sequence; 
the map h has infinitely many orbits whose itineraries 
are indistinguishable from random sequences. It also 
has infinitely many periodic orbits, corresponding to 
periodic sequences. These can be enumerated by listing 
all distinct sequences of lengths 1,2,3,... that contain 
no subsequences of smaller period, showing that there 
are approximately 2 N /N orbits of period N and a count- 
able infinity in all. Nonetheless, since numbers picked 
at random are almost always irrational (they form a set 
of full measure), almost all orbits are nonperiodic (see 
section 4.5). 


Applying h is equivalent to shifting the “binary point” 
and dropping the leading coefficient, 

(CI1CI2O3CI4 •••)>-* (a. 20.2,04 • ■ ■ ), 


4 Dynamical Systems Theory 

As noted above, dynamical systems theory emphasizes 
the study of the global orbit structure or phase portrait, 
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its dependence on parameters, and the description of 
qualitative properties such as the existence of closed 
orbits (periodic solutions). We start by describing some 
important tools and concepts. 

4. 1 Judicious Linearization 

While few nonlinear systems can be “solved” com- 
pletely, much can be deduced from Unear analysis. The 
linearization of the ODE (1) near a fixed point or equilib- 
rium x e , where /(x e ) = 0, is obtained by substituting 
x = x e + g into (1), expanding in a Taylor series, and 
neglecting quadratic and higher-order terms to obtain 

5 = D/(x e )§ for 151 « 1. (5) 

Here, Df(x e ) is the nxn Jacobian matrix of first 
partial derivatives dfi/dxj evaluated at x e . 

Linear ODEs with constant coefficients, like (5), are 
completely soluble in terms of elementary functions. 
One assumes the exponential form 
n 

5 = X v j exp(Ajt) 

3 = i 

and computes the eigenvalues Ay and eigenvectors Uy 
of Df{x e ). If Df(x e ) has n linearly independent eigen- 
vectors, then any solution may be expressed as a lin- 
ear combination x(t) = X. r j=i CjVj exp(Ajt), and the 
Cj can be uniquely determined by initial conditions. 
Eigenvalues and eigenvectors can, of course, be com- 
plex numbers and vectors, but using Euler’s formula 
exp(±it) = cos t±isint, and allowing complex Cj, solu- 
tions to real-valued ODEs can be written as combina- 
tions of exponential and trigonometrical functions. (If 
Df(x e ) has fewer than n linearly independent eigen- 
vectors, generalized eigenvectors are required and 
terms of the form t k exp(Aj t) appear.) 

If every eigenvalue of Df(x e ) has nonzero real part, 
x e is called a hyperbolic or nondegenerate fixed point. 
Excepting special cases, like the energy-conserving pen- 
dulum, fixed points are typically hyperbolic, and their 
stability in the original nonlinear system (1) can be 
deduced from the linearized system. 

A fixed point x e of (1) is Lyapunov or neutrally stable 
if for every neighborhood U 3 x e there is a neighbor- 
hood V £ U, also containing x e , such that every solu- 
tion x(t) of (1) starting in V remains in U for all t > 0. 
If x(t) — ■ x e as t — ■ oo for all x(0) 6 V, then x e is 
asymptotically stable. If x e is not stable, it is unstable. 
More descriptively, if all nearby orbits approach x e , it 
is called a sink; if all recede from it, it is a source', and 




Figure 3 Stable and unstable manifolds, (a) Near x e the local 
manifolds can be expressed as graphs, as in (7). (b) Glob- 
ally, stable and unstable manifolds may intersect, forming 
a homoclinic orbit. 

if some approach and some recede, it is a saddle point. 
Sinks are the simplest attractors : see section 4.5. 

Choosing the neighborhood V small enough that the 
linear part D/(x e )g dominates the higher-order terms 
that were ignored in (5), it follows that, if all the eigen- 
values of Df(x e ) have strictly negative real parts, then 
x e is asymptotically stable but, if at least one eigen- 
value has strictly positive real part, then x e is unstable. 
When one or more eigenvalues have zero real part, the 
local behavior is determined by the leading nonlinear 
terms, as described in section 4.3, but if x e is hyper- 
bolic, then the entire orbit structure nearby is topologi- 
cally equivalent to that of the nonlinear ODE (1). Similar 
results hold for the discrete mapping (2), and systems 
can also be linearized near periodic and other orbits. 

4.2 Invariant Manifolds 

Remarkably, the decomposition of the state space 
into invariant subspaces for the linearized system (5) 
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also holds for the nonlinear system near x e . Sup- 
pose that Df(x e ) has s ^ n and u = n - s eigen- 
values Ai, . , . , A s and A J+ i, . . . , A 5+M with strictly neg- 
ative and strictly positive real parts, respectively, and 
define the linear subspaces E s = span i i/| , . . . , u 5 ] and 
E u = span{u 5+ i, . . . , Uj +u } spanned by their (general- 
ized) eigenvectors. As t increases, orbits of (5) in the 
stable subspace E s decay exponentially, while those in 
the unstable subspace E u grow. 

The stable manifold theorem states that, near a 
hyperbolic fixed point x e , equations (1) and (2) possess 
local stable and unstable manifolds iV[* c (x e ), hjo C (x e ) 
of dimensions s and u, respectively, tangent at x e to 
E s and E u . The former consists of all orbits that start 
and remain near x (1 for all future time and approach it 
as t — ■ oo : 

Wf oc (x e ) = {x 6 V <f>t(x) — x e as t — ■ oo 

and 4>t(x) £ V for all t ^ 0}; (6) 

the unstable manifold W^ c (x e ) is defined similarly, 
with the substitutions “past time” and “t — ■ - oo These 
smooth, curved subspaces locally resemble their linear 
counterparts E s and £ u (see figure 3(a)). 

Near x e the local stable and unstable manifolds can 
be expressed as graphs over E s and E u , respectively. 
Letting E SL denote the (n - 5 (-dimensional orthogonal 
complement to E s and letting y 6 E s and z £ E s be 
local coordinates, we can write 

hj s oc (x e ) = {(y,z) | (y,z) £ B( 0), z = g{y)} (7) 

for a smooth function g : E s — E SL . We cannot generally 
compute g, but it can be approximated as described in 
section 4.3. 

The global stable and unstable manifolds are defined 
as the unions of backward and forward images of the 
local manifolds under the flow map: JT s (x e ) is the set 
of all points whose orbits approach x e as t — +oo, 
even if they leave B(x e ) for a while, and JV u (x e ) is 
defined analogously for t -» -oo. Stable manifolds can 
intersect neither themselves nor the stable manifolds 
of other fixed points, since this would violate unique- 
ness of solutions (the intersection point would lead 
to more than one future). The same is true of unsta- 
ble manifolds, but intersections of stable and unsta- 
ble manifolds can and do occur; they lie on solutions 
that lead from one fixed point to another. Intersection 
points of manifolds that belong to the same fixed point 
are called homoclinic, while those for manifolds that 
belong to different fixed points are called heteroclinic 
(see figure 3(b)). 



Figure 4 The stable, center, and unstable manifolds. 

4.3 Center Manifolds, Local Bifurcations, and 
Normal Forms 

As parameters change, so do phase portraits and the 
resulting dynamics. New fixed points can appear, the 
stability of existing ones can change, and homoclinic 
and heteroclinic orbits can form and vanish, bifur- 
cation theory [IV.21] addresses such questions, and 
it relies on three further ideas: structural instabil- 
ity, dimension reduction, and nonlinear coordinate 
changes. 

As noted in section 2, a structurally stable system 
survives small perturbations of its defining vector field 
/ or map F in the following sense. The phase portraits 
of the original and perturbed systems are topologically 
equivalent ; they can be transformed into each other 
by a continuous coordinate change that preserves the 
sense of time, so that sinks remain sinks and sources, 
sources. Since eigenvalues of the Jacobian matrix deter- 
mine stability, both the values and the derivatives of the 
perturbing functions must be small. 

There are many ways in which structural stability can 
be lost, but the simplest is when a single (simple) eigen- 
value passes through zero or when the real parts of a 
complex conjugate pair do the same. More generally, 
suppose that, in addition to 5 and u eigenvalues with 
negative and positive real parts, the Jacobian D/(x e ) 
also has c eigenvalues with zero real part ( 5 +c+ u = n). 

The center manifold theorem asserts that, along with 
the 5 - and u-dimensional stable and unstable mani- 
folds, a smooth c-dimensional local center manifold 
Wf oc exists, tangent to the subspace E c spanned by the 
eigenvectors belonging to the eigenvalues with zero 
real part. As figure 4 suggests, this allows one to 
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separate the locally stable and unstable directions from 
the structurally unstable ones and thus to reduce the 
analysis to that of a c-dimensional system restricted to 
the center manifold, a considerable simplification when 

c <k n. 

To describe the reduction process we assume that 
coordinates have been chosen with the degenerate 
equilibrium x e at the origin and such that the matrix 
Df(x e ) is block diagonalized. Then (1) can be written 
in the form 

x = Ax + flx,y), y = By + g)x,y), (8) 

where x e £ c and y e E s ®£ u , the span of the stable and 
unstable subspaces (explicit reference to the parameter 
p is dropped). All eigenvalues of the c x c matrix A have 
zero real parts and the (5 + u) x (s + u) matrix B has 
only eigenvalues with nonzero real parts. The center 
manifold can be expressed as a graph y = h(x) over 
E c , and hence, as long as solutions remain on W 7 ^, the 
state lx,y) of the system is specified by the x vari- 
ables alone. The reduced system is the projection of the 
vector field on onto the linear subspace E c : 

x = Ax + fix, h(x)). (9) 

The graph h is found by substituting y = h(x) into 
the second component of (8) and using the chain rule 
and the first component of (8) to obtain a (partial) 
differential equation: 

Dh(x)[Ax + fix, h(x))] = Bh(x ) + g(x, h)x)) 

with h( 0) = 0, Dh( 0) = 0. (10) 

The latter conditions are due to the tangency of 
to E c at x e = 0. Solutions of (10) can be approxi- 
mated as a Taylor series in x, and stability and bifur- 
cation behavior near the nonhyperbolic fixed point can 
be deduced from the resulting approximation to the 
reduced system (9). 

The third idea— to choose a coordinate system that 
simplifies the nonlinear terms of the reduced system— 
effectively extends the use of similarity transforma- 
tions to diagonalize and decouple components in lin- 
ear systems. Normal form theory simplifies the Taylor 
series by iteratively performing nonlinear coordinate 
changes that successively remove, at each order, non- 
resonant terms that do not influence the qualitative 
behavior. Lie algebra provides bookkeeping methods, 
and the computations can be (semi-) automated using 
computer algebra. 


To illustrate the resulting simplification, consider a 
two-dimensional ODE of the form 


x = Ax + fix), 




( 11 ) 


the linear part of which is a harmonic oscillator whose 
phase plane is filled with periodic orbits surrounding a 
neutrally stable fixed point (a center). Asymptotic sta- 
bility or instability of x e = 0 depends on the non- 
linear function fix), and in particular on the leading 
terms in its Taylor series. There are six quadratic terms, 
eight cubic terms, and in general 2 (k + 1) terms of 
order fe, but normal form transformations can succes- 
sively remove all the even terms and all but two terms 
of each odd order, at the expense of modifying those 
terms. After such a transformation and written in polar 
coordinates lx\ = r cos 6, X 2 = rsind), equation (11) 
becomes 

r = a 3 r 3 + a$r 5 + Olr 7 ), 

0= l + p 3 r 2 + f 5 r 4 + Olr 6 ). 

Not only is the number of coefficients greatly reduced, 
but the circular symmetry implicit in the localized lin- 
ear system is extended to higher order, uncoupling the 
azimuthal 0 dynamics from the radial dynamics. Since 
the latter alone govern decay or growth of orbits, sta- 
bility is determined by the first nonzero a^, explicit 
formulas for which emerge from the transformation. 

Normal forms not only simplify the functions defin- 
ing degenerate vector fields, they also allow the system- 
atic introduction of parameters that unfold the bifur- 
cation point to reveal the variety of structurally stable 
systems in its neighborhood, much as one can perturb 
a matrix to split a real eigenvalue of multiplicity 2 into 
a pair of distinct ones by adding a parameter. Equa- 
tion (12), for example, is unfolded by the addition of 
a linear term pr to the first component; as p passes 
through zero, a Hopf bifurcation occurs, giving rise to 
a limit cycle (see bifurcation theory [IV.21]). 



4.4 Limit Cycles and Poincare Maps 

The van der Pol equation— mentioned in section 2 as a 
motivating example in the study of chaotic orbits— also 
illustrates a simpler and more pervasive tool: the first 
return or Poincare map. Without external forcing, this 
self-excited oscillator possesses a stable limit cycle, an 
isolated periodic orbit that attracts all nearby orbits. 
The ODE is 

X \ = X2 + px 1 - 3 x 3 , X2 = — Xl , (13) 
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Figure 5 (a) An annular trapping region 2 and the limit cycle 
of the van der Pol equation (13) with p = 1. (b) The Poincare 
map for equation (13) expands lengths for x\ small and 
contracts them for x\ large, implying the existence of at 
least one fixed point corresponding to a periodic orbit. 

and linearization reveals that, for p > 0, this planar sys- 
tem has a source at ( X\,X 2 ) = (0,0). It can, moreover, 
be shown that all orbits eventually enter and remain 
within an annular trapping region B surrounding the 
source. Since no fixed points lie in S, the Poincare- 
Bendixson theorem implies that it contains at least one 
periodic orbit (see figure 5(a)). As indicated at the end 
of section 4.3, one can also show that a stable periodic 
orbit appears in a Hopf bifurcation as p passes through 
zero. 

More generally, if y is a periodic orbit in an ODE of 
dimension n ^ 2, we may construct a cross section £, a 


subset of an (n - 1) -dimensional manifold pierced by 
y at a point p and transverse to the flow in that all orbits 
cross £ with nonzero speed. Thus, by continuity, orbits 
starting at q e £ sufficiently close to p remain near y 
and next intersect £ again at a point q' e £, defining a 
Poincare map 

P\£~£ or q-q'=P(q). (14) 

Evidently, p is a fixed point for the map P. 

The positive x\ -axis is a suitable cross section for 
equation (13), and the instability of the fixed point 
(0, 0) and the attractivity of the trapping region S from 
outside imply that P(x i) takes the form sketched in 
figure 5(b). Since P is continuous, it must intersect the 
diagonal at least once in a fixed point p > 0, corre- 
sponding to the limit cycle. In fact, p is unique and the 
linearized map satisfies 0 < (dP/dxi)| Xl=p < 1, imply- 
ing asymptotic stability of both p and the limit cycle. 
In general, if all the eigenvalues of the linearized map 
DP(p) have moduli strictly less than 1, then p and its 
associated periodic orbit are asymptotically stable; if at 
least one eigenvalue has modulus greater than 1, they 
are unstable. 

This parallels the eigenvalue criteria for flows de- 
scribed in section 4.1, and the Poincare map provides 
a second connection between ODEs and iterated maps 
(cf. the time T flow map of section 1). Conversely, 
a mapping F\ U c R” — R” can be suspended 
to produce a T-periodic vector field on the (n + 1)- 
dimensional space R” x S 1 . Analyses of the ODE (1) 
and the mapping (2) are therefore closely related, and 
analogs of the stable, unstable, and center manifold 
theorems hold for iterated maps. 

4.5 Chaos and Strange Attractors 

Several definitions of chaos have been advanced, but 
the following captures its key properties. A set of differ- 
ential equations or an iterated map is called chaotic, or 
is said to have chaotic solutions, if it possesses a set S of 
orbits such that (1) almost all pairs of orbits in S display 
sensitive dependence on initial conditions; (2) there is 
an infinite set of periodic orbits that is dense in S; and 
(3) there is a dense orbit in S (see chaos [II.3]). 

Sensitive dependence means that, for any preassigned 
number /l < ]S], any point Xq in S, and any neighbor- 
hood U of xq, no matter how small, there exists a point 
yo e U and a time T such that the orbits x(T) and 
y(T) starting at xo and yo are separated by at least 
/); almost all solutions diverge locally. An orbit x(t) is 
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dense in S if, given any point z in S and any neighbor- 
hood U of z, no matter how small, there exists a time 
t z such that x(t z ) lies inside U\ x(t) passes arbitrarily 
close to all points in S. 

For the doubling map, a binary sequence a* cor- 
responding to a dense orbit x* can be built by con- 
catenating all subsequences of lengths 1, 2, 3, etc.: 

a* = 0 1 00 01 10 11 000 001 As one iterates and 

drops leading symbols, every subsequence appears at 
its head, implying that the orbit of x* contains points 
that lie arbitrarily close to every point in [0,1]. Hence 
{h k (x*)}% =0 is dense and h satisfies the definition, 
where S = [0,1] is the entire state space. 

Positive topological entropy also implies that typical 
orbits explore S, and there are examples of maps with 
this property that have no periodic orbits, so this def- 
inition is sometimes preferred. However, topological 
entropy (roughly speaking, the growth rate of distinct 
orbits as one progressively refines a discretization of S) 
is technically complicated, and the above definition will 
suffice here. 

4.5.1 Smale’s Horseshoe 


nested arches standing on pillars, each of width A 2 , 
lying within Vo and Vi. At each step, vertical distances 
are expanded by p and horizontal distances shrunk 
by A, and the middle (1 - 2 A) fraction of the strips is 
removed. Continuing, A_„ = f|j=-n G J (Q) is 2” verti- 
cal strips, each of width A n , and passing to the limit, 
A_ co = 0%-^GHQ) is a Cantor set of vertical line 
intervals. Similarly, A +O o = flj^o G J (Q) is a Cantor set 
of horizontal intervals, and since any vertical and any 
horizontal segment intersect at a point, A = A-„nA + » 
is a Cantor set of points. 

A Cantor set is uncountable, closed, and contains no 
interior or isolated points; every point is an accumu- 
lation point. Cantor sets are examples of fractals, sets 
having fractional dimension. 

Orbits of G can be described by symbolic dynamics, 
which codes points x 6 A as binary sequences based 
on their visits to Ho andHi. Amappinga: A — [0,1} Z , 
where a = {aj}t™ and [0, 1} Z denotes the space of 
bi-infinite sequences with entries 0 or 1, is defined as 
follows: 


0 if G(x) e H 0 , 

1 if G(x) g Hi. 


(16) 


The doubling map may seem to be a purely mathe- 
matical construct, but as promised in sections 2-3, it 
is related to the double pendulum and to forced non- 
linear oscillators. To understand this, we now describe 
Smale’s construction. Consider a piecewise-linear map- 
ping G defined on the unit square Q = [0, 1] x [0, 1] 
by means of its action on two horizontal strips Ho = 
[0, 1] x [0, 1 Ip] and Hi = [1 - 1/p, 1] x [0, 1] with 
images Vo = G(Ho) = [0, A] x [0,1] and Vi = G(H\) = 
[1 - A, 1] x [0,1], and having Jacobians 



A (f 


-A 

0 ' 

DG\h 0 = 

_° b_ 

DG\ Hl = 

_ 0 

~P_ 


for 0 < A < \ and p > 2 (so that the images fit within 
Q as defined). To make G continuous, the image of the 
central strip between Ho and Hi is taken as a semicir- 
cular arch joining Vo to Vi . G can be thought of as the 
Poincare map of a flow that compresses Q horizontally, 
then stretches it vertically, and finally bends its middle 
to form the eponymous horseshoe (see figure 6). 

The invariant set A of G consists of points that 
remain in Q under all forward and backward iterates: 
A = n£T-oo G J (Q). We construct A step by step. Points 
that remain in Q for one backward iteration occupy 
two vertical strips Vo and Vi, the images of Ho and 
Hi. G~ 2 (Q) n Q is obtained by considering the second 
iterates G(Vq) and G(Vi) of Ho and Hi, which form 


Given a suitable metric on [0, 1} Z , it can be shown that 
a is a one-to-one, continuous, invertible map, a homeo- 
morphism. Every point in A is faithfully coded by a 
sequence in [0, 1} Z , and, via equation (16), the action 
of G on A becomes the shift map 

cr: {0, 1} Z — {0, 1} Z , with a ; - = cr(flj + i). (17) 

This generalizes the binary representation of sec- 
tion 3.2 because G is invertible and points have past 
orbits as well as future orbits. The countably infinite 
set of periodic orbits is coded as before (for such an 
orbit, future and past are the same), but homoclinic 
and heteroclinic orbits to any periodic orbit or pair of 
orbits can now be formed by connecting semi-infinite 
periodic tails with an arbitrary central sequence. A 
dense orbit can be built by growing the sequence 
0 1 00 01 10 11 000 001 . . . forward and backward, and 
since all possible finite sequences appear in the set of 
all periodic orbits, this latter set is also dense in A. 

G is linear on Ho u Hi with eigenvalues ±A and ±p 
and A < 1 < p, so A is a hyperbolic set in which 
almost all pairs of orbits separate exponentially quickly 
and G is chaotic in the above sense. Moreover, it is 
structurally stable, so the chaos survives small per- 
turbations, and Smale also proved that horseshoes 
appear in any smooth map that possesses a transverse 
homoclinic point. 
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Figure 6 Smale’s horseshoe, showing the square Q (left) and its images G(Q) (center right) 
and G 2 (Q) (far right), with the strips Hj and their images Vj = G(Hj) shaded. 


There are computable conditions that detect homo- 
clinic points for rather general classes of weakly per- 
turbed systems. Applying these to the ODEs describ- 
ing a double pendulum, like that of figure 1 but with a 
heavy upper link and a light lower one, allows one to 
prove that, given any alternating sequence of positive 
and negative integers Si, - 52 , S 3 , . . . , an orbit exists on 
which the lower link first turns 5 i times clockwise, then 
52 times counterclockwise, then 53 times clockwise, and 
so on. Since any sequence chosen from a suitable prob- 
ability distribution corresponds to an orbit, one cannot 
tell whether the motion is deterministic or stochastic. 

The horseshoe A is a set of saddle type; almost all 
orbits that approach it eventually leave, so its presence 
does not necessarily imply physically observable chaos. 
However, its stable manifold can form a fractal bound- 
ary separating orbits that have different fates, as in 
the forced van der Pol equation. To observe persistent 
chaos, one requires strange attractors. 

4.5.2 Strange Attractors 

Again there are subtle issues and competing defini- 
tions, but to convey the main ideas we define an attract- 
ing set JA as the intersection of all forward images of a 
trapping region B, 

JT=f| (MB), ( 18 ) 

tJsO 

and an attractor as an attracting set that contains 
a dense orbit. Sinks and asymptotically stable limit 
cycles provide simple examples, as do stable invariant 
tori that carry quasiperiodic, and hence densely wind- 
ing, motions. A strange attractor additionally exhibits 
sensitive dependence and chaos. The following exam- 


ple further emphasizes the geometrical viewpoint of 
dynamical systems theory. 

The lorenz equations [III.20] are a (very) low- 
dimensional projection of the coupled Navier-Stokes 
and heat equations modeling convection in a fluid layer: 


k\ = cr(x 2 - xi), 

X 2 = PX 1 - X2 ~ X1X3 , 
X3 = ~ fiX 3 + XiX 2 . 


(19) 


The parameters cr and (I are fixed and p is proportional 
to the temperature difference across the layer. The ori- 
gin is always a fixed point, representing stationary fluid 
and heat transport by conduction. For p > 1, x = 0 is 
a saddle point with one positive and two negative real 
eigenvalues; two further fixed points q~, correspond- 
ing to steadily rotating convection cells, also exist. For 
cr = 10, p = 28, p = | (the values used by Lorenz), 
these are also saddles, but a trapping region exists and 
(19) has an attracting set JA containing the unstable 
fixed points. 

Lorenz, who had studied with Birkhoff, realized that 
JA has infinitely many “sheets.” He also generated a 
one -dimensional return map related to the doubling 
map of section 3.2 and gave a symbolic description of 
its orbits. We adopt the geometric model of Gucken- 
heimer and Williams here, defining a cross section Z 
lying above the origin. In suitable nonlinear coordin- 
ates, Z is a square [—1,1] x [-1,1] whose boundaries 
±1 x [— 1, 1] containq* and their local stable manifolds, 
and whose centerline Ox [-1,1] lies in the stable man- 
ifold of 0. Assuming that all orbits leaving Z circulate 
around ± 1 x [- 1 , 1 ] (except those in 0 x [—1,1], which 
flow into x = 0), a Poincare map F can be defined (see 
figure 7(a)). 
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Figure 7 The Poincare map of the geometric Lorenz attrac- 
tor. (a) A cross section Z showing Z ± and their images 
F(Z ± ), and the subset V (light gray) and its image F{V) 
(dark gray), (b) The one-dimensional map f(u). 


More precisely, Guckenheimer and Williams assert 
that coordinates ( u,v ) can be chosen in which the 
horizontal (u) dynamics decouples and F takes the 
form 

F(u,v) = (f(u),g(u,v)), (20) 

where g contracts in the v -direction, / expands by a 
factor greater than or equal to V2 in the u-direction, 
and F(-u,-v) = -F(u,v) respects the symmetry 
(xi,X 2 ,X 3 > — ■ (-xi,-X 2 ,X 3 ) of (19). Thus, F maps 
the open rectangles Z~ = (-1,0) x (-1,1) and Z + = 
(0, 1) x (-1, 1) into the interior of Z, and since orbits 
starting near u = 0 pass close to 0, the strong stable 
eigenvalue of 0 pinches the images of Z ± at their end- 
points (r ± ,s ± ) = lim u _oF(it,t'). Continuing to iter- 
ate, a complicated attracting set appears: at the second 
step, F(Z ± ) comprises 4 thinner strips, 2 each inside 
F(Z~) andF(Z + ), then8, 16, etc., as in the Cantor set of 
Smale’s horseshoe, but pinched together at their ends. 
Defining a subset V c Z (shaded in figure 7), it can be 
shown that JA = contains a dense orbit 


and so JA is an attractor. Due to the expansive nature 
of / (figure 7(b)), which resembles the doubling map 
of section 3.2, F has sensitive dependence and JA is 
therefore strange. 

The geometric picture has been verified by W. Tucker 
for equation (19) using a computer-assisted proof 
showing that a stable foliation exists, a continuous fam- 
ily of curves f such that, if Ce J, thenF(C) e f (here, 
jF is composed of vertical line segments u = const.). 

The key properties necessary for strange attractors 
are (1) stretching in some state-space directions and 
(2) contraction in others, so that volumes decrease, cou- 
pled with (3) bending (the horseshoe) or discontinuous 
cutting (the Lorenz example) to place forward images 
under the flow map 4> t into a trapping region. There 
are now many examples, and smooth maps hke that of 
the horseshoe, which often arise in applications, can 
produce very complicated dynamics, including infinite 
sequences of homochnic bifurcations. 

5 Conclusion 

I have sketched some central ideas and themes in 
dynamical systems theory but have necessarily omit- 
ted much. In closing, here are some important topics 
and connections to other concepts, areas, and problems 
described in this volume. 

Chaotic orbits admit statistical descriptions, and 
there is a flourishing ergodic theory of dynamical sys- 
tems, describing, for example, invariant measures sup- 
ported on strange attractors and decay of correlations 
along orbits. Stochastic ODEs are also treated proba- 
bilistically (see APPLICATIONS OF STOCHASTIC ANALY- 
SIS [IV.14]). Interest in hybrid systems [11.18], with 
nonsmooth vector fields and discontinuous jumps, is 
growing, and the classification of their bifurcations pro- 
ceeds (see SLIPPING, SLIDING, RATTLING, AND IMPACT: 
NONSMOOTH DYNAMICS AND ITS APPLICATIONS [VI. 1 5]). 

Many equations inherit symmetries from the phe- 
nomena they model, and equivariant dynamical sys- 
tems is an important area. Symmetries can both con- 
strain behaviors and stabilize objects that typically 
lack structural stability in nonsymmetric systems (e.g., 
heteroclinic cycles). Symmetries also profoundly affect 
normal forms and their unfoldings (see symmetry in 
APPLIED MATHEMATICS [IV.22]). 

The Lorenz and other examples show that steady 
three-dimensional velocity fields of fluids or other con- 
tinuums can exhibit chaotic mixing, which dramatically 
enhances transport of heat, pollutants, and reactants in 
such flows (see continuum mechanics [IV.26]). 
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IV.21 Bifurcation Theory 

Paul Glendinning 


1 Introduction 

Bifurcation theory describes the qualitative changes in 
the dynamics of systems caused by small changes of 
parameters in the model. It uses dynamical systems 
theory [IV.20] to determine the different conditions 
that lead to changes, and to describe the dynamics as 
a function of the parameter. This leads to a theoret- 
ical list of typical changes that can be used to under- 
stand the existence and stability of different solutions 
in a problem. Parameters are present in most applica- 
tions. They are quantities that are constant in any real- 
ization of the system but that may take different values 
in different realizations depending on the details of the 
situation being modeled. Examples include the Raleigh 
and Reynolds numbers in fluid dynamics, virulence in 
disease models, and reaction rates in chemistry. 


The “tipping points” that generate so much interest 
in economics, say, or climate change are good exam- 
ples of the idea of a bifurcation: if the parameters sat- 
isfy some condition then the associated dynamics is of 
one type, but if they stray above a threshold, even by a 
small amount, then the resultant dynamics can change 
drastically. 

There are two types of bifurcation, though the two 
can be mixed in more complicated problems. One 
involves changes in the spectrum of the linearization 
of a system about some solution; this type is an exten- 
sion of the classical linear stability analysis that dom- 
inated so much of applied mathematics in the 1950s 
and 1960s. This leads to local bifurcation theory. The 
second type of bifurcation builds on Poincare’s qualita- 
tive analysis of dynamical systems to look at global fea- 
tures of the system, and in particular homoclinic and 
heteroclinic orbits (a homoclinic orbit approaches the 
same solution in forward and backward time, while a 
heteroclinic orbit is a solution that tends to one solu- 
tion in forward time and another in backward time). 
The analysis of perturbations of these solutions is the 
basis of global bifurcation theory. 

Although bifurcation theory can be applied to more 
general systems, we will restrict ourselves to the cases 
of autonomous ordinary differential equations in con- 
tinuous time, which generate flows, and maps in dis- 
crete time, which generate sequences, i.e., 

x = /(x,p) (flow), 
x n +i=f(xn,b) (map), 

where x e R p are the dependent variables and p E R m 
are the parameters. The simplest solutions are those 
that are constant. For flows, this implies that x = 0, 
so /(x,p) = 0. These solutions are called stationary 
points. For maps, a constant solution is called a fixed 
point. Fixed points satisfy x n+ \ = x n , or x = /(x,p). 
Since periodic orbits of flows can be analyzed using a 
return map [IV.20], the discrete-time case can describe 
behavior near periodic orbits of flows. 

In some sense, then, bifurcation theory for the sim- 
plest dynamical objects is about the variation in the 
number of solutions of equations such as /(x,p) = 
0 as the parameters vary. This leads to an approach 
through singularity theory. The linear stability of solu- 
tions is determined by the eigenvalues of the lineariza- 
tion (the Jacobian matrix) of the flow or map. In contin- 
uous time, simple eigenvalues of the Jacobian matrix 
A lead to solutions that are proportional to e At and 
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eigenvalues with negative real parts are therefore sta- 
ble while those with positive real parts are unstable. In 
the discrete-time case, solutions are proportional to A” , 
so the role of the imaginary axis is replaced by the unit 
circle since A” — 0 as n — oo if |A| <1. The power of 
local bifurcation theory lies in its ability to go beyond 
this linear approach. 

Linearization determines the local behavior near sta- 
tionary points and fixed points provided the eigen- 
values of the Jacobian matrix are not on the imaginary 
axis (for continuous-time flows) or the unit circle (for 
discrete-time maps). Such solutions are called hyper- 
bolic, and no local bifurcations can occur at hyperbolic 
solutions. Local bifurcation theory therefore considers 
behavior near nonhyperbolic solutions; local in both 
phase space and parameter space. 

One further term that is used in bifurcation theory 
needs explanation before we move on to examples. The 
codimension of a bifurcation is essentially the num- 
ber of parameters required to be able to observe the 
bifurcation in typical systems. Codimension-one bifur- 
cations are therefore observed in one-parameter fami- 
lies of systems, while higher-codimension bifurcations 
are observed by varying more parameters. These can 
often act as organizing centers for bifurcations with 
lower codimension. 

2 Five Canonical Examples 

The “typical” local bifurcations can be classified using 
their Taylor series expansions. In section 3 we dis- 
cuss why the changes in dynamics described here are 
actually more general, but first we simply describe the 
different possibilities using examples. 

The first is the saddle-node or tangent bifurcation. 
Consider the following scalar equations, with p = m = 
1 in (1): 

X = p - x 2 (flow), 

X n +1 = x n + p- x 2 (map). 

Looking for stationary points of the continuous-time 
system (x = 0) and fixed points of the discrete-time 
system (x n +i = x n ) gives x 2 = p, so if p < 0 there are 
no such points, while if p > 0 there are two, x± = ±-,jp. 
Stability is determined by the lxl Jacobian matrix of 
derivatives. For the flow this is -2x ± , so x+ is stable 
(negative eigenvalue) and X- is unstable. For the map 
the Jacobian is 1 - 2x ± , which has modulus less than 
one at x+ (for small p), indicating stability, while x_ 



Figure 1 The simple bifurcations. 


is unstable. The existence and stability of stationary 
points or fixed points are represented in the bifurcation 
diagram of figure 1(a), which shows the locus of sta- 
tionary points and periodic orbits on the vertical axis 
as a function of the parameter (on the horizontal axis). 
By convention, the stability of solutions is indicated by 
plotting stable solutions with solid lines and unstable 
solutions with dotted lines. 

If the origin is constrained to be a solution for all p, or 
if the linear term in the parameter of the Taylor series 
of the function / (x, p) of (1) vanishes at the bifurcation 
point, then a simple scalar model is 

x = px - x 2 (flow), 

x n +i = (1 + p)x n - x 2 (map). 

In these cases the stationary points are at x = 0 and 
x = p, and for small |p| the origin is stable if p < 
0 and unstable if p > 0, while the nontrivial solution 
has the opposite stability properties. This is called a 
transcritical bifurcation or exchange of stability, since 
the two branches of solutions cross and exchange their 
stability properties, as shown in figure 1(b). 

If there is symmetry in the problem or if additional 
coefficients in the Taylor expansion of the map are zero, 
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then the quadratic part may vanish identically, leaving 
x = px - x 3 (flow), 

x n +i = (1 + fJ)x„ - x 3 (map). 

This leads to a pitchfork bifurcation. If ! p j is sufficiently 
small, then the origin is stable if p < 0 and unstable if 
P > 0. If p > 0, a pair of new stable stationary solutions 
with x 2 = p also exists. This is a supercritical pitchfork 
bifurcation: the pair of bifurcating solutions are stable 
(see figure 1(c)). If these solutions are unstable, then the 
bifurcation is said to be subcritical. 

For maps there is another way in which a fixed 
point can lose stability in one-dimensional systems: the 
eigenvalue of the linearization can pass through -1. In 
this case the canonical example is 

X n + 1 = -d + p)x n + X 3 , 

which has a unique local fixed point at x = 0 that is 
stable if p < 0 and unstable if p > 0 (with \p\ suffi- 
ciently small). Due to the symmetry of the equations it 
is easy to see that there are two nontrivial solutions to 
the equation x n +i = -x n if p > 0. These lie on a stable 

orbit of period two (x x -> x), and this is therefore 

called a period-doubling bifurcation (see figure 1(d)). To 
indicate that the bifurcating period-two orbit is stable, 
it is called a supercritical period-doubling bifurcation; 
if the new orbit is unstable, coexisting with the stable 
fixed point, it is called subcritical. These bifurcations 
do not have a direct analogue for stationary points of 
flows, but they can occur as bifurcations of periodic 
orbits of flows, where the map represents a Poincare 
map of the flow. 

Another local bifurcation, shown in figure 1(e), oc- 
curs only in maps or flows with a phase space of dimen- 
sion greater than one: a pair of complex conjugate 
eigenvalues pass through the imaginary axis for flows, 
or the unit circle for maps. Using polar coordinates 
(r, 0 ) for the flow and complex coordinates z = x + \y 
for the map, the canonical examples are 

r = pr - r 3 , 6 = to (flow), 

z n +i = (1 + p)e 2nlu) z n - \z n \ 2 z„ (map). 

The origin is stable once again if p < 0 and j p j is suf- 
ficiently small, but if p > 0 there is an attracting circle 
with radius Jp. In the continuous case, this is a stable 
periodic orbit, and the bifurcation is called a supercriti- 
cal Hopf bifurcation. In the discrete case, the dynamics 
on the invariant circle is more complicated: solutions 
are periodic if to is rational, and nonperiodic and dense 
on the circle if to is irrational. With more general nonlin- 
ear terms, this Hopf (or Neimark-Sacker) bifurcation of 


maps has more cases. The rational case usually breaks 
up into a finite collection of stable and unstable peri- 
odic orbits, and these exist over intervals of parameters 
in a phenomenon called mode locking. 

3 Dimension Reduction 

The examples above are relevant more generally be- 
cause there is a (local) dimension reduction possible 
near nonhyperbolic fixed points. Intuitively, this can be 
described by noting that, in eigenspaces correspond- 
ing to eigenvalues that are not on the imaginary axis 
(for flows) or the unit circle (for maps), the dynamics 
is determined by the linearization and decays to zero 
in forward time for stable directions or backward time 
for unstable directions. The only interesting directions 
in which changes can occur are, therefore, the “cen- 
tral” directions associated with zero or purely imagi- 
nary eigenvalues (flows) or eigenvalues with modulus 
one (maps). 

It turns out that this observation can be made rigor- 
ous in at least two ways. These are the center man- 
ifold theorem [IV.20 §4.3] and Lyapunov-Schmidt 
reduction. These describe how to construct projections 
onto the nonhyperbolic directions (valid for parameters 
close to the values at which the nonhyperbolic solution 
exists) such that any change in dynamics occurs in this 
projection. In particular, if there is a simple nonhyper- 
bolic eigenvalue (the “typical” case), then the analysis 
reduces to a system with the same dimension as the 
corresponding eigenspace, i.e., dimension one for a real 
eigenvalue and dimension two for a complex conjugate 
pair of eigenvalues. 

On these projections the leading terms of the Taylor 
series expansion of the projected equations are essen- 
tially the examples of the previous section. In the Hopf 
bifurcation for maps there are some extra subtleties 
that are admirably described in Arrowsmith and Place’s 
book An Introduction to Dynamical Systems. 

The mathematical analysis of the bifurcations de- 
scribed above then follows using the dimension reduc- 
tion and further analysis, e.g., by using the implicit 
function theorem to describe all possible local fixed 
points or periodic orbits. This produces a set of gener- 
icity (or nondegeneracy) conditions in terms of higher- 
order derivatives of the function / of (1) at the bifurca- 
tion point that need to be satisfied in addition to the 
existence of a neutral direction if the corresponding 
bifurcation is to be observed. 

Although the center manifold theorem and Lyapu- 
nov-Schmidt reduction lead to the same conclusions 
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about dimension reduction, they use quite different 
methods. The center manifold theorem is established 
using invariant manifold theory for dynamical systems, 
e.g., by following the invariant manifold from the triv- 
ial linear case into the nonlinear regime. Lyapunov- 
Schmidt reduction relies on projection techniques onto 
linear eigendirections. 

4 Global Bifurcations 

Many global bifurcations involve homoclinic orbits. A 
homoclinic orbit to a stationary point of a flow is 
an orbit that approaches the stationary point in for- 
ward and backward time. This is a codimension-one 
phenomenon in typical families of differential equa- 
tions, and so there is a straightforward bifurcation in 
which a given homoclinic orbit exists at an isolated 
value of the parameter. For periodic orbits the situation 
is rather different because homoclinic orbits typically 
persist over a range of parameter values, and the inter- 
est is therefore in how homoclinic orbits are created or 
destroyed as the parameter varies. 

4. 1 Homoclinic Bifurcations to Stationary Points: 

Planar Flows 

Suppose an autonomous flow in the plane is defined 
by a differential equation that can be written (after a 
change of coordinates) as 

x = \ix+fi(x,y,p), y = -\ 2 y+h(x,y\p), (2) 

where the functions /* vanish at the origin for all p and 
denote nonlinear terms. The direction of time has been 
chosen such that A 2 > Ai > 0, so locally the stationary 
point at the origin has a saddle structure as shown in 
figure 2. There is a one-dimensional stable manifold of 
solutions that tend to the origin as t — ■ 00 that is tan- 
gential to the y-axis at the origin, and there is a one- 
dimensional unstable manifold of solutions that tend 
to the origin as t — ■ - 00 that is tangential to the x-axis 
at the origin. Can these stable and unstable manifolds 
intersect? 

Typically, one-dimensional sets can intersect trans- 
versely in two dimensions, so it might be expected that 
it is fairly easy for such intersections to occur and that 
these would be persistent under small changes of the 
parameters. However, if the two manifolds intersect at 
a point, then the trajectory through the intersection 
point must also be in the intersection of the mani- 
folds. In other words, the intersection cannot be trans- 
verse, and a fairly straightforward argument shows that 





Figure 2 The unstable manifold of a planar homoclinic 
bifurcation: (a) p < 0; (b) p = 0, showing the return planes 
used in the construction of the return map; and (c) p > 0. 

such intersections typically occur on codimension-one 
manifolds in parameter space. 

Now suppose that if p = 0 there is a homoclinic orbit, 
as shown in figure 2(b). Then as p passes through zero 
we can expect that the arrangements of the stable and 
unstable manifolds vary as shown in parts (a) and (c) 
of figure 2. The local change in behavior can be deter- 
mined using a return map approach. Close to the sta- 
tionary point, the flow is approximated by the linear 
part of (2), as nonlinear terms are much smaller, so 
approximate solutions are 

x = xoe Alt , y = yoe~ A2t . 

Thus if h > 0 is a small constant, solutions that start at 
(xo, h) with xo > 0 intersect x = h after time T, where 
h ~ xoe AlT or T » (1/Ai) log(h/xo). At this intersec- 
tion, y = yi w he~ AlT ~ h(xolh) Al/Al . In other words, 
a solution starting at (xo ,h) with xo > 0 intersects 
x = h at (h,y\) and then evolves close to the unstable 
manifold away from the stationary point until it returns 
to y = h. If we choose the parametrization so that the 
unstable manifold of the stationary point first strikes 
y = h at (p,h), then the solution starting at (h,yi) 
with yi > 0 small enough will intersect y = h at (xi , h) 
close to (p, h). Since this flow is bounded away from 
stationary points, the return map is smooth and can be 
approximated by a Taylor series: x\ = p + Ay\ + ■ ■ ■ . 
Composing these two maps, the linear approximation 
near the origin and the return map close to the homo- 
clinic orbit outside a neighborhood of the origin, we 
find that the full flow close to the homoclinic orbit for 
|pl sufficiently small is modeled by the approximate 
return map 

x n+ i « p + ax*, x n > 0, S = A 2 /A 1 , 

with a constant, and the map is undefined if x n 0 
(if x n = 0 the solution returns on the stable manifold 
and tends to the stationary point, while if x n < 0 we 
have no information about the behavior of the unstable 
manifold). Since 5 > 1, there is a simple fixed point if 
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Figure 3 Parameter (ft) versus period ( T ) plots of the simple 
periodic orbit as it approaches the homoclinic orbit at p = 0 
with T — oo. (a) Real eigenvalues, or a saddle-focus, with 
p/A > 1. (b) A saddle-focus with p/A < 1 (the Shilnikov 
case). 


p > 0 at x* » p that is stable since the slope of the 
map is small, of order x 5 ^ 1 . Most of the time is spent 
in the neighborhood of the stationary point, and so as 
p tends to zero from above, the period scales like 

T ~ -£logp, 

i.e., it tends to infinity and approaches the homoclinic 
orbit itself. If p < 0 then there is no periodic solution 
locally. The effect of the bifurcation is to destroy or 
create a stable periodic orbit. The period of this orbit 
as a function of the parameter is sketched in figure 3(a). 


with 5 = A3 /Ai < 1. If a > 0 then the slope of the map 
tends to infinity as x tends to 0 from above or below, 
so there are no stable periodic orbits. In fact, if p > 0 
all solutions leave a neighborhood of x = 0. However, 
ifp < 0 then there is an unstable chaotic set of solu- 
tions similar to the chaotic set for the map 2x (mod 1) 
in the article on dynamical systems [IV. 20 §3.2], It is 
this chaotic set, or more accurately a subset of this set, 
that stabilizes at higher parameter values to become 
the LORENZ ATTRACTOR [111.20]. 

4.3 Homoclinic Bifurcations to Stationary Points: 

Shilnikov Flows 

In generic systems a homoclinic orbit will approach the 
stationary point tangential to the eigenspace associ- 
ated with the eigenvalue of the Jacobian matrix with 
smallest positive real part as t — -00 and the eigen- 
value with negative real part having the smallest mod- 
ulus as t — 00. These are called the leading eigenvalues 
of the stationary point. In higher dimensions it is only 
the leading eigenvalues that determine the behavior of 
generic systems. The results can be phrased for general 
flows in R p , but for simplicity we consider only p = 3 
and p = 4. 

This means that there are two extra cases: the saddle- 
focus, 


4.2 Homoclinic Bifurcations to Stationary Points: 
Lorenz-Like Flows 

In R 3 , the bifurcation of a homoclinic orbit to a station- 
ary point with a Jacobian matrix that has real eigen- 
values is typically similar to the planar case. A peri- 
odic orbit is created or destroyed by the bifurcation. 
An interesting exception occurs if the system has the 
symmetry ( x,y,z ) — ■ (-x,-y,z) with linearization 

x = Aix + /i(x,y,z;p), 
y = -A 2 y + / 2 (x,y,z;p), 
z = -A 3 z + / 3 (x,y,z;p), 

with A 2 > A 3 > Ai > 0, and again the functions ft van- 
ish at the origin for all p and denote nonlinear terms. 
In this case the second branch of the unstable mani- 
fold (the branch in x < 0) is the symmetric image of 
the branch in x > 0, and analysis similar to that of 
the previous subsection implies that the return map is 
defined in both x > 0 and x < 0 by symmetry: 

p + ax„, x„ > 0, 

Xfl+1 ~ | . . c 

-p - a \x n \°, x n < 0, 


x = Ax + fi(x,y, z\ p), 
y = -py + coz + f 2 (x,y,z;p), 
z = -coy - pz + / 3 (x,v,z;p), 
where p and A are positive; and the focus-focus 
x = px + coy + fi(x,y,z,w\p), 
y = -cox + py + /2(x,y,z,ic;p), 
z = -Rz + Dw + / 3 (x,y,z,tc;p), 
w = -Qz - Rw + /4(x,y, z,w\p), 

with R,p > 0, and Q and co nonzero. In both sets of 
equations the functions j\ are nonlinear and vanish at 
the origin for all p. 

The dynamics associated with the existence of homo- 
clinic orbits to stationary points described locally 
by the saddle-focus or focus-focus was analyzed by 
Leonid Shilnikov in the mid-to-late 1960s. For the 
saddle-focus with p/A > 1, the situation is similar to 
the planar case: a periodic orbit can be continued with 
changing parameter and approaches the homoclinic 
orbit monotonically as in figure 3(a). In the remain- 
ing cases the situation is more exotic. Shilnikov proved 
that at the parameter value for which the homoclinic 
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orbit exists there are a countable number of (unstable) 
periodic orbits and the dynamics contains horseshoes 
(and therefore unstable chaotic sets: see dynamical sys- 
tems). Concentrating on the simplest periodic orbits — 
those that visit a neighborhood of the origin once in 
each period — it is possible to show that in the saddle- 
focus case with p/A < 1 these exist with period T at 
parameters p, where (to lowest order) 

p ~ AeT pr cos (cuT + <fi), 

which is an oscillatory approach to p = 0 as T — 
oo, as shown in figure 3(b). The turning points here 
correspond to saddle-node bifurcations, and on every 
branch involving stable orbits in the saddle-node bifur- 
cations the orbit loses stability via period doubling 
before it reaches p = 0. Moreover, there are infi- 
nite sequences of more complicated homoclinic orbits 
accumulating on p = 0; these homoclinic orbits pass 
through a neighborhood of the stationary point more 
than once before tending to that point, and they are 
called multipulse homoclinic orbits. The focus-focus 
case is similar, but the accumulations can be more 
complicated. 

4.4 Homoclinic Tangles 

Whereas the intersection of stable and unstable mani- 
folds is generally a codimension-one phenomenon for 
flows, it is persistent for maps. This is because the 
intersection must be at least one dimensional for flows, 
whereas for maps the existence of a point of inter- 
section implies the existence of only a countable set 
of intersections. If a point p is on the intersection 
of the stable and unstable manifolds of a fixed point 
of a smooth invertible map /, then so are its preim- 
ages f~ n (p), n = 1,2,3,..., which accumulate on 
the fixed point along its unstable manifold, and its 
images f n (x), which accumulate on the fixed point 
along its stable manifold. This gives rise to homoclinic 
tangles, as illustrated in figure 4(a). Transverse inter- 
sections of stable and unstable manifolds imply that 
chaotic sets exist, although the chaotic dynamics is 
not stable. This phenomenon also occurs in periodi- 
cally forced systems, which can be analyzed via strobo- 
scopic (or Poincare) maps and provides one of the few 
ways of proving that a system has chaotic solutions. A 
range of techniques, known as Melnikov methods, have 
been developed to prove the existence of intersections 
between stable and unstable manifolds. 




Figure 4 (a) Homoclinic tangles for maps. 

(b) Homoclinic tangency. 

4.5 Homoclinic Tangencies 

The creation of persistent transverse intersections of 
stable and unstable manifolds as a parameter varies 
involves the existence of homoclinic tangencies at a 
critical parameter value as illustrated in figure 4(b). On 
one side of this value of the parameter there is no inter- 
section locally, and on the other there is a transverse 
intersection. The existence of a homoclinic tangency 
implies all sorts of exotic dynamics at nearby parame- 
ters. In the 1970s Sheldon Newhouse proved that there 
are parameters close to the bifurcation point at which 
the system has a countably infinite number of stable 
periodic orbits, “infinitely many sinks,” and by the end 
of the 1990s it had been established that there are 
nearby parameter values (a positive measure of them) 
at which the system has a strange attractor. Moreover, 
the existence of a homoclinic tangency implies the exis- 
tence of homoclinic tangencies for higher iterates of 
the map at nearby parameter values at which the same 
analysis holds, so the whole picture repeats for higher 
periods. 


5 Cascades of Bifurcations 

By the mid-1970s numerical simulations had shown 
that simple families of maps such as the quadratic (or 
logistic) map 

X n +1 = PX n ( 1 - X n ) 

can have many complicated bifurcations (see the lo- 
gistic equation [III.19]). Mitchell Feigenbaum showed 
that a much stronger and more precise statement is 
possible. For smooth families of maps there are infinite 
sequences (cascades) of period-doubling bifurcations. 
Thus there are parameter values p n at which an orbit 
of period 2” has a period-doubling bifurcation, creat- 
ing an orbit of period 2 n+1 , n = 0, 1, 2, ... , and these 


IV.21. Bifurcation Theory 


399 


accumulate on a limit p* at a rate 

lim hn - 1 - Pn = s 

n — co p n p n + 1 

or \p n - p*| ~ CS~ n . Most remarkably, 5 is a con- 
stant independent of the details of the map, a prop- 
erty referred to as quantitative universality. In fact, this 
“Feigenbaum constant” 5 should really be thought of as 
a function of the order of the turning point of the map; 
if it behaves like \x\ a with a > 1 , then the rate is a 
function of a. For the typical quadratic turning point, 

5 ~ 4.669201609.... 

This quantitative universality also manifests itself in 
the phase space as a sort of self-similarity under scal- 
ing. Feigenbaum was able to understand this by appeal- 
ing to arguments based on renormalization analysis 
from statistical physics. At the accumulation point 
p*, the second iterate of the map, restricted to a 
smaller interval and suitably rescaled, behaves in 
almost exactly the same way as the map itself: they 
both have periodic orbits of period 2 ” for all n = 

0, 1, 2, 3 This idea can be exploited by defining a 

map T on unimodal functions /: [— 1 , 1 ] — [- 1 , 1 ] 
with a maximum at x = 0 and with /( 0 ) = 1 : 

T 'f(x) = ° f(-c<x), 

where a = -/( 1). Seen as a map on an appropriately 
defined function space, it turns out that T has a fixed 
point /* (so T /* = /*) with a codimension-one sta- 
ble manifold consisting of maps corresponding to the 
special accumulation values of parameters p* for some 
family, and a one-dimensional unstable manifold with 
eigenvalue 5 > 1 (it is actually a little more compli- 
cated than this, but this statement contains the essen- 
tial structure). The universal accumulation rate 8 is 
therefore the unstable eigenvalue of a map in function 
space, and the universal scaling is due to the fact that 
under renormalization, maps on the stable manifold of 
/* accumulate on /* and so the spatial scaling « tends 
to -/*( 1 ). 

These period-doubling cascades are not limited to 
the doubling of fixed points. They can also be associ- 
ated with periodic orbits of any period, creating cas- 
cades of period 2 n p for some fixed p. Period-doubling 
cascades also occur in higher dimensions and have 
been observed numerically in simulations of partial dif- 
ferential equations and experimentally in, for example, 
the changing convection patterns of liquid helium. The 
Lorenz maps of section 4.2 can also display cascades 
of homoclinic bifurcations if the saddle index A 3 /A 1 


is greater than one due to a simple correspondence 
between the Lorenz maps and unimodal maps based 
on the symmetry of the system. 

Many codimension-two bifurcations involve the exis- 
tence of infinitely many bifurcation curves, and these 
again can lead to cascades of bifurcations along appro- 
priately chosen paths in parameter space. The special 
role played by period-doubling of orbits of period 2 n is 
that it is a natural route to chaos: on one side of the 
accumulation point of the period-doubling sequence 
associated with a fixed point of the map the map is 
simple, in the sense that it has only a finite number of 
periodic orbits, while on the other side it is chaotic with 
infinitely many periodic orbits of different periods (this 
is not true of cascades associated with other periods in 
general). 

6 Codimension-Two Bifurcations 

Codimension-two bifurcations are associated with spe- 
cial properties of a flow that can be observed in two- 
parameter systems but not (typically) in one-parameter 
families. Since there is often scope to vary more than 
one parameter in systems of interest, and because 
they act as useful organizing centers for describing the 
dynamics of one-parameter families and are the obvi- 
ous next step after the codimension-one cases, many 
codimension-two bifurcations have been catalogued. 
Here we give two representative examples. One, the 
Takens-Bogdanov bifurcation, involves two zero eigen- 
values at a stationary point and therefore generalizes 
the idea of local bifurcations to two-parameter systems. 
The other, a gluing bifurcation, occurs when there is a 
special value in parameter space at which two homo- 
clinic orbits exist without a symmetry. This provides 
an example with an infinite set of bifurcations but no 
chaos. 

6.1 Takens-Bogdanov Bifurcations 

If a stationary point of a flow has two zero eigen- 
values, then the linear part of the differential equation 
on the center manifold in appropriate coordinates v\411 
typically have the form 


after a linear change of coordinates (there is obviously 
another possibility with every coefficient zero but this 
is more complicated and nongeneric). A natural way 
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to unfold this singularity is to introduce two small 
parameters, p and v, and consider the linear part 


with characteristic equation 5 2 - V 5 - p = 0. If p = 0 
there is a simple root (so a bifurcation such as a saddle 
node or pitchfork), and if v = 0 and p < 0 there is a 
purely imaginary pair of eigenvalues, suggesting a Hopf 
bifurcation (see section 2). By considering the nonlin- 
ear terms of lowest order after successive near-identity 
changes of coordinates, it is possible to show that the 
local behavior is typically modeled by the normal form 

x = y, y = p + vy + x 2 + bxy, 

where we treat p and v as small parameters and, after 
scaling, b = ±1. In what follows we treat the case b = 1 
as an example. If p = v = 0 then the origin is a sta- 
tionary point and the Jacobian is (3). The bifurcation at 
p = 0 is a saddle node (except in the degenerate case 
v = 0), and if p = -v 2 (p < 0), there is a Hopf bifur- 
cation creating a periodic orbit that exists in v 2 < —p 
immediately after the bifurcation. Thinking about a cir- 
cular path in parameter space enclosing the origin, it is 
clear that there must be another bifurcation (the peri- 
odic orbit is created but never destroyed), and in fact 
there is a homoclinic bifurcation on a curve starting at 
the origin given to leading order by p = -||v 2 that 
creates/destroys the periodic orbit as expected. 

6.2 A Gluing Bifurcation 

Suppose that at some parameter a system has a pair 
of homoclinic orbits in the same configuration as the 
Lorenz equations but without the symmetry and with 
saddle index A 3 /A 1 = 5 > 1 (see section 4.2). There 
is then a natural two-parameter unfolding locally such 
that if one parameter is zero, one of the homoclinic 
orbits persists, and if the other parameter is zero, then 
the other homoclinic orbit exists, with both therefore 
existing at the intersection of the two axes. In the same 
way that an approximate return map was derived for 
the Lorenz-type flows of section 4.2, the analysis of this 
more general configuration leads to the approximate 
map 

-p + ax'n, x n > 0, 

Xfl+1 ~ I c 

[v-a\x n \°, x n <0, 

with 5 > 1 and a > 0. There are now two small param- 
eters, p and v, and since 5 > 1 the derivative of the 
map is small if \x\ is small and nonzero. These maps 


have many similarities with maps of the circle that can 
be exploited to show that as the parameter varies in a 
path from ( p , v) = (f , 0 ) to (e , 0 ) for some small e > 0 , 
the proportion of points of the attractor varies contin- 
uously from zero to one, and if it is rational then there 
is a stable periodic orbit with that proportion of points 
in x > 0. These stable periodic orbits exist on small 
intervals of the parameters in a form of mode locking. 

7 Bits and Pieces 

There are many other areas that could be covered in 
a review such as this, and in this final section we 
summarize just four of them. 

7.1 Saddle-Node on Invariant Circle 

Suppose that a saddle-node bifurcation of a station- 
ary point occurs on a periodic orbit for a flow. At first 
sight this might appear to be a codimension-two bifur- 
cation: one parameter for the saddle node and the other 
to locate it on a periodic orbit. However, if the peri- 
odic orbit is stable, then at the point of saddle-node 
bifurcation the connection between the weakly unsta- 
ble direction and the stable manifold is persistent, so 
it is in fact codimension one. These bifurcations, called 
SNICs (saddle-node on invariant circle), create a sta- 
ble periodic orbit from a stable-saddle pair of station- 
ary points. The period of the orbit tends to infinity as 
the bifurcation point is approached like the inverse of 
the square root of the parameter (cf., the logarithmic 
divergence for homoclinic bifurcations in section 4). 

The SNIC in discrete-time systems is again of codi- 
mension one, but this time chaotic solutions exist in a 
neighborhood of the bifurcation and there are complex 
foldings of the unstable manifold as it approaches the 
fixed point. This was described in detail by Newhouse, 
Palis, and Takens in the early 1980s. 

7.2 Intermittency 

Intermittency describes motion that is close to being 
periodic for a long time, called the laminar phase, 
and that then has a chaotic burst before returning 
to the laminar, almost periodic, phase. It can also be 
seen as a prototype for the more general mixed-mode 
oscillations observed in neuroscience and elsewhere. 
Although not strictly a bifurcation, the phenomenon 
of intermittency is seen as parameters vary and this 
makes it worth mentioning here. There are several dif- 
ferent types of intermittency. The simplest case can 
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be observed in the quadratic (or logistic) map x n+ \ = 
px n ( 1 - x„ ). There are parameter values with stable 
periodic orbits that can lose stability via saddle-node 
bifurcations as the parameter is decreased through a 
critical value p c . Just before the saddle-node bifurca- 
tion creates the stable periodic orbit, solutions that 
pass close to the locus of the periodic orbit will spend 
a long time, proportional to \p - p c l~ 1/2 , close to the 
points where the periodic orbit will eventually be cre- 
ated, before moving away. If there is some reinjec- 
tion mechanism due to global properties of the map, 
then after moving away the solution may find its way 
back into the laminar region and the process repeats. 
The SNIC is a special case in which the reinjection is 
always at the same point. This sequence of long lam- 
inar regions interspersed by bursts is generalized by 
the idea of mixed-mode oscillations, where the laminar 
region may be replaced by much more complicated but 
localized behavior. 


7.3 Hamiltonian Flows 


If the defining differential equation can be derived from 
a Hamiltonian, so the phase space can be divided into 
two sets of variables (q,p) £l a x R n with a function 
H(q,p) such that 


dH 

qi = Wi’ Pi 


dH ■ 1 

- = — , i = 1 n, 

dqt 


then this special structure (and generalizations of it) 
implies that the dynamics has very special features. 
These equations are important because they include 
many Newtonian models where friction is ignored. The 
equations of motion imply that H is constant on solu- 
tions, so solutions lie on level sets of the function 
H. This in turn implies that features that are spe- 
cial (that is, have codimension one) in generic systems 
are robust in Hamiltonian systems. A good example 
is the existence of homochnic orbits, so the global 
bifurcations of Hamiltonian systems are very differ- 
ent from those of the more general systems described 
in section 4. (Note that the global bifurcation theo- 
rems given above assumed that the homoclinic orbits 
exist on codimension-one surfaces in parameter space.) 
Even local bifurcations are special: a stationary point 
of a two-dimensional Hamiltonian system is typically 
either a saddle or a center, and this means that the 
saddle-node bifurcation involves the creation of a cen- 
ter and a saddle, with a set of periodic motions around 
the center, bounded by a robust homoclinic orbit! The 


differential equation with Hamiltonian 
H(q,p) = \p 2 + pq - |q 3 
has just such a transition as p passes through zero. 

7.4 Homoclinic Snaking 

Since the existence of homoclinic orbits can be persis- 
tent with changing parameters in Hamiltonian systems, 
homoclinic orbits may be continued in parameter space 
in the same way that periodic orbits can be continued 
in more general systems. One interesting feature that 
is observed is homoclinic snaking, where the curve of 
homoclinic orbits oscillates as some measure (such as 
a norm) increases with parameters. Models of pattern 
formation such as the swift-hohenberg equations 
[IV.2 7 §2] exhibit this snaking if the time-dependent 
solutions are considered in one spatial dimension. 
Localized solutions (solitary waves) can be found that 
tend to zero at large spatial amplitudes but that have 
more and more complicated oscillations between these 
limits. In this case, the continuations of these solutions 
oscillate in parameter space, gaining an extra spatial 
maximum or minimum on each monotonic branch, in 
a pattern that is reminiscent of the oscillation of the 
periodic solution near the Shilnikov homoclinic orbits 
of figure 3, except that the oscillations have a bounded 
amplitude in parameter space rather than decreasing 
amplitude as in the figure. 


7.5 Piecewise-Smooth Systems 


Many applications in mechanics (friction and impacts, 
for example), biology (gene switching), and control 
(threshold control) are modeled by piecewise-smooth 
systems, where the dynamics evolves continuously 
until some switching surface is reached, and then there 
is a transition to a different (but still smooth) dynamics, 
possibly with a reset to another part of phase space. 
These are examples of hybrid systems [11.18]. The 
evolution is therefore a sequence of behaviors deter- 
mined by smooth dynamical systems that are stitched 
together across the switching surfaces. New bifurca- 
tions occur when a periodic orbit intersects a switch- 
ing surface tangentially, introducing a new segment to 
the dynamics, or if a fixed point or periodic orbit of a 
map intersects the switching surface. In the example of 
impacting systems, a standard model return map has a 
square root singularity: 


Xn + 1 


jp - ,JXn, X n >0, 

Ip + ax n , x n < 0. 


(4) 
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The maps obtained in these situations have much in 
common with the Lorenz maps discussed earlier with 
5 < 1 as the derivative is unbounded as x tends to zero 
from above, but they are also continuous unimodal 
maps and so general theorems such as Sharkovskii’s 
theorem (which describes how the existence of a peri- 
odic orbit of a given period implies the existence of 
other periodic orbits in continuous maps) hold (see 
logistic equation [III. 19]). When applied to the spe- 
cial case of unimodal maps this restricts the order 
in which periodic orbits are created. However, while 
the logistic map has many windows of stable periodic 
orbits, each with its own period-doubling cascade and 
associated chaotic motion, the stable periodic orbits of 
the square root map (4) are much more constrained. For 
example, if 0 < a < |, the stable periodic orbits form 
a period-adding sequence as p decreases through zero. 
There is a parameter interval on which a period-n orbit 
is the only attractor, followed by a bifurcation creating 
a stable period- (n+ 1) orbit so that the two stable orbits 
of period n and period (n + 1) coexist. Then there is 
a further bifurcation at which the period-n orbit loses 
stability and the stable period -) n + 1) orbit is the only 
attractor. This sequence of behavior is repeated with n 
tending to infinity as p tends to zero. 

8 Afterview 

Bifurcation theory provides insights into why certain 
types of dynamics occur and how they arise. In cases 
such as period-doubling cascades it provides a frame- 
work in which to understand the changes in complexity 
of dynamics even if the behavior at a single param- 
eter value might appear nonrepeatable. The number 
of different cases that may need to be considered can 
proliferate, and there is currently a nonuniformity of 
nomenclature that means that it is hard to tell whether 
a particular case has been studied previously in the lit- 
erature. Given that the techniques are useful in any 
discipline that uses dynamic modeling, this aspect is 
unfortunate and leads to many reinventions of the 
same result. However, this only underlines the central 
role played by bifurcation theory in understanding the 
dynamics of mathematical models wherever they occur. 
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IV.22 Symmetry in Applied 
Mathematics 

Ian Stewart 

We tend to think of the applied mathematical toolkit as 
a collection of specific techniques for precise calcula- 
tions, each intended for a particular kind of problem, 
solving a system of algebraic or differential equations 
numerically, for example. But some of the most power- 
ful ideas in mathematics are so broad that at first sight 
they seem too vague and nebulous to have practical 
implications. Among them are probability, continuity, 
and symmetry. 

Probability started out as a way to capture uncer- 
tainty in gambling games, but it quickly developed into 
a vitally important collection of mathematical tech- 
niques used throughout applied science, economics, 
sociology— even for formulating government policy. 

Continuity proved such an elusive concept that it was 
used intuitively for centuries before it could be defined 
rigorously; now it underpins calculus, perhaps the most 
widely used mathematical tool of them all, especially in 
the form of ordinary and partial differential equations. 
But continuity is also fundamental to topology, a rel- 
ative newcomer from pure mathematics that is start- 
ing to demonstrate its worth in the design of efficient 
trajectories for spacecraft, improved methods for fore- 
casting the weather, frontier investigations in quantum 
mechanics, and the structure of biologically important 
molecules, especially deoxyribonucleic acid (DNA). 

Symmetry, the topic of this article, was initially a 
rather ill-defined feeling that certain parts of shapes 
or structures were much the same as other parts of 
those shapes or structures. It has since become a vital 
method for understanding pattern formation through- 
out the scientific world, with applications that range 
from architecture to zoology. Symmetry, it turns out, 
underlies many of the deepest aspects of the natural 
world. Our universe behaves the way it does because of 
its symmetries— of space, time, and matter. Both rela- 
tivity and particle physics are based on symmetry prin- 
ciples. Symmetry methods shed light on difficult prob- 
lems by revealing general principles that can help us 
find solutions. Symmetry can be static, dynamic, even 
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Figure 1 Which shapes are symmetric? 


chaotic. It is a concept of great generality and deep 
abstraction, where beauty and power go hand in hand. 

1 What Is Symmetry? 

Symmetry is most easily understood in a geometric set- 
ting. Figure 1 shows a variety of plane figures, some 
symmetric, some not. Which are which? 

The short answer is that the top row consists of sym- 
metric shapes and the bottom row consists of asym- 
metric shapes. But there is more. Not only can shapes 
be symmetric, or not— they can have different kinds of 
symmetry. The heart (the middle of the top row) has the 
most familiar symmetry, one we encounter every time 
we look at ourselves in a mirror: bilateral symmetry. 
The left-hand side of the figure is an exact copy of the 
right-hand side, but flipped over. Three other shapes in 
the figure are bilaterally symmetric: the circle, the pen- 
tagon, and the circle with a square hole. The pentagon 
with a tilted square hole is not; if you flip it left-right, 
the pentagonal outline does not change but the square 
hole does because of the way it is tilted. 

However, the circle has more than just bilateral sym- 
metry. It would look the same if it were reflected in 
any mirror that runs through its center. The pentagon 
would look the same if it were reflected in any mir- 
ror that runs through its center and passes through a 
vertex. 

What about the fourth shape in the top row, the flow- 
ery thing? If you reflect it in a mirror, it looks different, 
no matter where the mirror is placed. However, if you 
rotate it through a right angle about its center, it looks 
exactly the same as it did to start with. So this shape has 
rotational symmetry for a right-angle rotation. Thus 
primed, we notice that the pentagon also has rotational 
symmetry, for a rotation of 72°, and the circle has 
rotational symmetry for any angle whatsoever. 

Most of the shapes on the bottom row look com- 
pletely asymmetric. No significant part of any of them 
looks much like some other part of the same shape. The 


possible exception is the pentagon with a square hole. 
The pentagon is symmetric, as we have just seen, and 
so is the square. Surely combining symmetric shapes 
should lead to a symmetric shape? On the other hand, 
it does look a bit lopsided, which is not what we would 
expect from symmetry. With the current definition of 
symmetry, this shape is asymmetric, even though some 
pieces of it are symmetric. 

In the middle of the nineteenth century mathemati- 
cians finally managed to define symmetry, for geomet- 
ric shapes, by abstracting the common idea that unifies 
all of the above discussion. The background involves 
the idea of a transformation (or function or map). Some 
of the most common transformations, in this context, 
are rigid motions of the plane. These are rules for 
moving the entire plane so that the distances between 
points do not change. There are three basic types. 

Translations. Slide the plane in a fixed direction so 
that every point moves the same distance. 

Rotations. Choose a point, keep this fixed, and spin the 
entire plane around it through some angle. 
Reflections. Choose a line, think of it as a mirror, and 
reflect every point in it. 

These transformations do not exhaust the rigid mo- 
tions of the plane, but every rigid motion can be ob- 
tained by combining them. One of the new transfor- 
mations produced in this way is the glide reflection ; 
reflect the plane in a line and then translate it along 
the direction of that line. 

Having defined rigid motions, we can now define 
what a symmetry is. Given some shape in the plane, 
a symmetry of that shape is a rigid motion of the plane 
that leaves the shape as a whole unchanged. Individ- 
ual points in the shape may move, but the end result 
looks exactly the same as it did to start with— not just 
in terms of shape, but also location. 

For example, if we reflect the pentagon about a line 
through its center and a vertex, then points not on the 
line flip over to the other side. But they swap places in 
pairs, each landing where the other one started from, so 
the final position of the pentagon fits precisely on top 
of the initial position. This is false for the pentagon with 
a square hole, and this is the reason why that shape is 
not considered to possess symmetry. 

The shapes in figure 1 are all of finite size, so they 
cannot possess translational symmetry. Translational 
symmetries require infinite patterns. A typical example 
is a square tiling of the entire plane, like bathroom tiles 
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but continued indefinitely. If this pattern is translated 
sideways by the width of one tile, it remains unchanged. 
The same is true if it is translated upward by the width 
of one tile. It follows that, if the tiling is translated an 
integer number of widths in either of these two direc- 
tions, it again remains unchanged. The symmetry group 
of translations is a lattice, consisting of integer com- 
binations of two basic translations. The square tiling 
also has rotational symmetries, through multiples of 
90° about the center of a tile or a corner where four 
tiles meet. Another type of rotational symmetry is rota- 
tion through 180° about the center of the edge of a tile. 
The square tiling pattern also has various reflectional 
symmetries. 

Lattices in the plane can be interpreted as wallpa- 
per patterns. In 1891 the pioneer of mathematical crys- 
tallography Yevgraf Fyodorov proved that there are 
exactly 17 different symmetry classes of wallpaper pat- 
terns. George Polya obtained the same result indepen- 
dently in 1924. Lattice symmetries, often in combina- 
tion with rotations and reflections, are of crucial impor- 
tance in crystallography, but now the “tiling” is the 
regular atomic structure of the crystal, and it repeats 
in three-dimensional space along integer combinations 
of three independent directions. Again there may also 
be rotations and reflections. The physics of a crystal 
is strongly influenced by the symmetries of its atomic 
lattice. In the 1890s Fyodorov, Arthur Shonflies, and 
William Barlow proved that there are 230 symmetry 
types of lattice, or 219 if certain mirror-image pairs are 
considered to be the same. 

Of course, no physical crystal can be of infinite 
extent. However, the idealization to infinite lattices can 
be a very accurate model for a real crystal because the 
size of the crystal is typically much larger than the lat- 
tice spacing. Real crystals differ from this ideal model 
in many ways: dislocations, where the lattice fails to 
repeat exactly; grain boundaries, where local lattices 
pointing in different directions meet; and so on. All 
applied mathematics involves a modeling step, repre- 
senting the physical system by a simplified and ideal- 
ized mathematical model. What matters is the extent 
to which the model provides useful insights. Its failure 
to include certain features of reality is not, of itself, a 
valid criticism. In fact, such a failure can be a virtue if 
it makes the analysis simpler without losing anything 
important. 

Symmetries need not be rigid motions. Another geo- 
metric symmetry is dilation— change of scale. A log- 
arithmic spiral, found in nature as a nautilus shell, 


remains unchanged if it is dilated by some fixed amount 
and also rotated through an appropriate angle. 

Symmetry does not apply only to shapes; it is 
equally evident in mathematical formulas. For exam- 
ple, x + y + z treats the three variables x, y, and z 
in exactly the same manner. But the expression x 3 + 
y - 2 z 2 does not. The first formula is symmetric in 
x, y, and z; the second is not. This time the trans- 
formations concerned are permutations of the three 
symbols — ways to swap them around. However we per- 
mute them, x + y + z stays the same. For instance, if 
we swap x and y but leave z the same, the expression 
becomes y + x + z. By the laws of algebra, this equals 
x + y + z. But the same permutation applied to the 
other expression yields y 3 + x - 2 z 2 , which is clearly 
different. 

Symmetry thereby becomes a very general concept. 
Given any mathematical structure, and some class of 
transformations that can act on the structure, we define 
a symmetry to be any transformation that preserves the 
structure— that is, leaves it unchanged. 

If a physical system has symmetry, then most sensi- 
ble mathematical models of that system will have corre- 
sponding symmetries. (I say “most” because, for exam- 
ple, numerical methods cannot always incorporate all 
symmetries exactly. No computer model of a circle can 
be unchanged by all rotations. This inability of numer- 
ical methods to capture all symmetries can sometimes 
cause trouble.) The precise formulation of symmetry 
for a given model depends on the kind of model being 
used and its relation to reality. 

2 Symmetry Groups 

The above definition tells us that a symmetry of some 
structure (shape, equation, process) is not a thing but 
a transformation. However, it also tells us something 
deeper: structures may have several different sym- 
metries. Indeed, some structures, such as the circle, 
have infinitely many symmetries. So there is a shift of 
emphasis from symmetry to symmetries, not symme- 
try as an abstract property of a structure but the set of 
all symmetries of the structure. 

This set of transformations has a simple but vital fea- 
ture. Transformations can be combined by performing 
them in turn. If two symmetries of some structure are 
combined in this way, the result is always a symmetry 
of that structure. It is not hard to see why: if you do not 
change something, and then you again do not change 
it . . . you do not change it. This feature is known as the 
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“group property,” and the set of all symmetry transfor- 
mations, together with this operation of composition, 
is the symmetry group of the structure. 

The shapes in figure 1 illustrate several common 
kinds of symmetry group. The rotations of the circle 
form the circle group or special orthogonal group in 
the plane, denoted by S 1 or SO(2). The reflections and 
rotations of the circle form the orthogonal group in 
the plane, denoted by 0(2). The rotations of the pen- 
tagon form the cyclic group Z 5 of order 5 (the order 
of a group is the number of transformations that it 
contains). The rotations of the flower shape form the 
cyclic group Z 4 of order 4. There is an analogous cyclic 
group Z„ whose order is any positive integer n; it can 
be defined as the group of rotational symmetries of 
a regular n-sided polygon. If we also include the five 
reflectional symmetries of the pentagon, we obtain the 
dihedral group D 5 , which has order 10. There is an 
analogous dihedral group D n of order 2 n for any pos- 
itive integer n\ the rotations and reflections of a regu- 
lar n-sided polygon. Finally, we mention the symmet- 
ric group S n of all permutations of a set with n ele- 
ments, such as {1, 2, 3, . . . ,n}. This has order n\, that 
is, n(n - l)(n — 2) ■ ■ ■ 3 ■ 2 ■ 1. 

Even asymmetric shapes have some symmetry; the 
identity transformation “leave everything as it is.” The 
symmetry group contains only this trivial but useful 
transformation, and it is symbolized by 1 . 

We mention one useful piece of terminology. Often 
one group sits inside a bigger one. For example, SO (2) 
is contained in 0(2), and Z„ is contained in D„. In such 
cases, we say that the smaller group is a subgroup of 
the bigger one. (They may also be equal, a trivial but 
sensible convention.) 

The study of groups has led to a huge area of 
mathematics known as group theory. Some of it is 
part of abstract algebra, especially when the group is 
finite — that is, contains a finite number of transforma- 
tions. Examples are the cyclic, dihedral, and symmetric 
groups. Another area, involving analysis and topology, 
is the theory of Lie groups, such as the circle group, 
the orthogonal group, and their analogues in spaces of 
any dimension. Here the main emphasis is on continu- 
ous families of symmetry transformations, which cor- 
respond to all choices of some real number. For exam- 
ple, a circle can be rotated through any real angle. Yet 
another important area is representation theory, which 
studies all the possible ways to construct a given group 
using matrices, linear transformations of some vector 
space. 


One of the early triumphs of group theory in applied 
mathematics was Noether’s theorem, proved by Emmy 
Noether in 1918. This applies to a special type of 
differential equation known as a Hamiltonian system, 
which arises in models of classical mechanics in the 
absence of frictional forces. Celestial mechanics— the 
motion of the planets— is a significant example. The 
theorem states that whenever a Hamiltonian system 
has a continuous family of symmetries, there is an asso- 
ciated conserved quantity. “Conserved” means that this 
quantity remains unchanged as the system moves. 

The laws of nature are the same at all times: if you 
translate time from t to t + 6, the laws do not look 
any different. These transformations form a continu- 
ous family of symmetries, and the corresponding con- 
served quantity is energy. Translational symmetry in 
space (the laws are the same at every location) corre- 
sponds to conservation of momentum. Rotations about 
some axis in three-dimensional space provide another 
continuous family of symmetries; here the conserved 
quantity is angular momentum about that axis. All 
of the conservation laws of classical mechanics are 
consequences of symmetry. 

3 Pattern Formation 

Symmetry methods come into their own, and nowadays 
are almost mandatory, in problems about pattern for- 
mation. Often the most striking feature of some natural 
or experimental system is the occurrence of patterns. 
Rainbows are colored circular arcs of light. Ripples 
caused by a stone thrown into a pond are expanding cir- 
cles. Sand dunes, ocean waves, and the stripes on a tiger 
or an angelfish are all patterns that can be modeled 
using repeating parallel features. Crystal lattices are 
repeating patterns of atoms. Galaxies form vast spirals, 
which rotate without (significantly) changing shape— a 
group of symmetries combining time translation with 
spatial rotation. 

Many of these patterns arise through a general mech- 
anism called “symmetry breaking.” This is applicable 
whenever the equations that model a physical system 
have symmetry. I say “equations” here, even though 
I have already insisted that the symmetries of the 
system should appear in the equations, because it is 
not unusual for the model equations to have more 
symmetry than the pattern under consideration. Our 
theories of symmetry breaking and pattern formation 
rest on the structure of the symmetry group and its 
implications for mathematical models of symmetric 
systems. 
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Figure 2 Taylor-Couette apparatus, showing 
flow pattern for Taylor vortices. 

A classic example is Taylor-Couette flow, in which 
fluid is confined between two rotating cylinders (fig- 
ure 2). In experiments this system exhibits a bewilder- 
ing variety of patterns, depending on the angular veloc- 
ities of the cylinders and the relative size of the gap 
between them. The experiment is named after Mau- 
rice Couette, who used a fixed outer cylinder and a 
rotating inner one to measure the viscosity of fluids. 
At the low velocities he employed, the flow is feature- 
less, as in figure 3(a). In 1923 Geoffrey Ingram Taylor 
noted that when the angular velocity of the inner cylin- 
der exceeds a critical threshold, the uniform pattern 
of Couette flow becomes unstable and instead a stack 
of vortices appears; see figures 2 and 3(b). The vor- 
tices spiral round the cylinder, and alternate vortices 
spin the opposite way in cross section (small circles 
with arrows). Taylor calculated this critical velocity and 
used it to test the navier-stokes equations [III.23] 
for fluid flow. 

Further experimental and theoretical work followed, 
and the apparatus was modified to allow the outer 
cylinder to rotate as well. This can make a difference 
because, in a rotating frame of reference, the fluid 
is subject to additional centrifugal forces. In these 
more general experiments, many other patterns were 
observed. Figure 3 shows a selection of them. 

The most obvious symmetries of the Taylor-Couette 
system are rotations about the common axis of the 
cylinders. These preserve the structure of the appara- 
tus. But notice that not all patterns have full rotational 
symmetry. In figure 3 only the first two— Couette flow 
and Taylor vortices— are symmetric under all rotations. 
Another family of symmetries arises if the system is 




Figure 3 Some of the numerous flow patterns in the Tay- 
lor-Couette system: (a) Couette flow; (b) Taylor vortices; 
(c) wavy vortices; (d) spiral vortices; (e) twisted vortices; and 
(f) turbulent vortices. Source: Andereck et al. (1986). 
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modeled (as is common for some purposes) by two 
infinitely long cylinders but restricted to patterns that 
repeat periodically along their lengths. In effect, this 
wraps the top of the cylinder round and identifies it 
with the bottom. Mathematically, this trick employs a 
modeling assumption: “periodic boundary conditions” 
that require the flow near the top to join smoothly to 
the flow near the bottom. With periodic boundary con- 
ditions, the equations are symmetric under all vertical 
translations. But again, not all patterns have full trans- 
lational symmetry. In fact, the only one that does is 
Couette flow. So all patterns except Couette flow break 
at least some of the symmetries of the system. 

On the other hand, most of the patterns retain some 
of the symmetries of the system. Taylor vortices are 
unchanged by vertical translations through distances 
equal to the width of a vortex pair (see figure 2). The 
same is true of wavy vortices, spirals, and twisted vor- 
tices. We will see in section 7 how each pattern in the 
figure can be characterized by its symmetry group. 

There are at least two different ways to try to under- 
stand pattern formation in the Taylor-Couette system. 
One is to solve the Navier-Stokes equations numeri- 
cally. The computations are difficult and become infea- 
sible for more complex patterns. They also provide lit- 
tle insight into the patterns beyond showing that they 
are (rather mysterious) consequences of the Navier- 
Stokes equations. The other is to seek theoretical 
understanding, and here the symmetry of the appa- 
ratus is of vital importance, explaining most of the 
observed patterns. 

4 Symmetry of Equations 

To understand the patterns that arise in the Taylor- 
Couette system, we begin with a simpler example and 
abstract its general features. We then explain what 
these features imply for the dynamics of the system. 
In a later section we return to the Taylor-Couette sys- 
tem and show how the general theory of dynamics with 
symmetry classifies the patterns and shows how they 
arise. This theory is based on a mathematical conse- 
quence of the symmetry of the system being modeled; 
the differential equations used to set up the model have 
the same symmetries as the system. A symmetry of an 
equation is defined to be a transformation of the vari- 
ables that sends solutions of the equation to (usually 
different) solutions. 

The appropriate formulation of symmetry for an 
ordinary differential equation is called equivariance. 
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Cube Rod Plate 

Figure 4 (a) Cube under compression (force at back 
not shown), (b) Rod solution, (c) Plate solution. 

Suppose that F is a group of symmetries acting on the 
variables x = (x \ , . . . , x m ) in the equation 

dx E V \ 

dt =F(x) - 

Then F is equivariant if 

F(yx) = yF(x ) for all symmetries y in F. 

It follows immediately that, if x(t) is any solution of 
the system, so is yx(t) for any symmetry y. In fact, 
this condition is logically equivalent to equivariance. 
In simple terms, solutions always occur in symmet- 
rically related sets. We will see an example of this 
phenomenon below. 

A central concept in symmetric dynamics is bifur- 
cation [IV.21], for which the context is a family of dif- 
ferential equations— an equation that contains one or 
more parameters. These are variables that are assumed 
to remain constant when solving the equation but can 
take arbitrary values. In a problem about the motion 
of a planet, for example, the mass of the planet may 
appear as a parameter. Such a family undergoes a 
bifurcation at certain parameter values if the solutions 
change in a qualitative manner near those values. For 
example, the number of equilibria might change, or 
an equilibrium might become a time-periodic oscilla- 
tion. Bifurcations have a stronger effect than, say, mov- 
ing an equilibrium continuously or slightly changing 
the shape and period of an oscillation. They provide a 
technique for proving the existence of interesting solu- 
tions by working out when simpler solutions become 
unstable and what happens when they do. 

To show how symmetries behave at a bifurcation 
point, we consider a simple model of the deformation 
of an elastic cube when it is compressed by six equal 
forces acting at right angles to its faces (figure 4(a)). 
We consider deformations into any cuboid shape with 
sides ( a,b,c ). When the forces are zero, the shape of 
the body is a cube, with a = b = c. The symmetry 
group consists of the permutations of the coordinate 
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Force 

Figure 5 Bifurcation diagram (solid lines, stable; 
dashed lines, unstable). 

axes, changing ( a,b,c ) to ( a,c,b ), ( b,c,a ), and so on. 
Physically, these rotate or reflect the shape, and they 
constitute the symmetric group S 3 . 

As the forces increase, always remaining equal, the 
cube becomes smaller. Initially, the sides remain equal, 
but analysis of a simplified but reasonable model shows 
that, when the forces become sufficiently large, the fully 
symmetric “cube” state becomes unstable. Two alterna- 
tive shapes then arise: a rod shape (a,b,c) with two 
sides equal and smaller than the third, and a plate 
shape (a, b,c) with two sides equal and larger than the 
third (figure 4). The bifurcation diagram in figure 5 is a 
schematic plot of the shape of the deformed cube, plot- 
ted vertically, against the forces, plotted horizontally. 
The diagram shows how the existence and stability of 
the deformed states relate to the force. In this model 
only the plate solutions are stable. 

In principle there might be a third shape, in which 
all three sides are of different lengths; however, such 
a solution does not occur (even unstably) in the model 
concerned. 

Earlier, I remarked that solutions of symmetric equa- 
tions occur in symmetrically related sets. Here, rod 
solutions occur in three symmetrically related forms, 
with the longer side of the rod pointing in any of the 
three coordinate directions. Algebraically, these solu- 
tions satisfy the symmetrically related conditions a = 
b < c, a = c < b, and b = c < a. The same goes for 
plate solutions. 

5 Symmetry Breaking 

The symmetry group for the buckling cube model 
contains six transformations: 

I : (a, b , c) •- ( a , b, c), 

X : (a, b , c) •- ( a , c, b), 


Y : (a, b, c) •- (c, b, a), 

Z: ia,b,c) •- ( b,a,c ), 

R\ ( a,b,c ) •- ( b,c,a ), 

S : (a, b, c) •- ( c,a , b). 

We can consider the symmetries of the possible 
states, that is, the transformations that leave the shape 
of the buckled cube unchanged. Rods and plates have 
square cross section, and the two lengths in those direc- 
tions are equal. If we interchange those two axes, the 
shape remains the same. Only the cube state has all 
six symmetries S 3 . The rod with a = b < c is symmet- 
ric under the permutations that leave z fixed, namely 
{/, Z}. The rod with a = c < b is symmetric under the 
permutations that leave y fixed, namely {I,Y}. The rod 
with b = c < a is symmetric under the permutations 
that leave x fixed, namely 1 1, X} . The same holds for the 
plates. If a solution existed in which all three sides had 
different lengths, its symmetry group would consist of 
just {/}. All of these groups are subgroups of S 3 . 

Notice that the subgroups {I,X}, {I,Y}, { I,Z } are 
themselves related by symmetry. For example, a solu- 
tion ( a,b,c ) with a = b < c becomes ( b,c,a ) with 
b = c < a when the coordinate axes are permuted 
using R. In the terminology of group theory, these three 
subgroups are conjugate in the symmetry group. 

The buckling cube is typical of many symmetric sys- 
tems. Solutions need not have all of the symmetries 
of the system itself. Instead, some solutions may have 
smaller groups of symmetries— subgroups of the full 
symmetry group. Such solutions are said to break sym- 
metry. For an equivariant system of ordinary differen- 
tial equations, fully symmetric solutions may exist, but 
these may be unstable. If they are unstable, the system 
will find a stable solution (if it can), and this typically 
breaks the symmetry. 

In general, the symmetry group of a solution is a 
subgroup of the symmetry group of the differential 
equation. Some subgroups may not occur here. Those 
that do are known as isotropy subgroups. For the buck- 
ling cube, the group S 3 has the five isotropy sub- 
groups listed above. One subgroup is missing from that 
list, namely {I,R,S}. This is not an isotropy subgroup 
because a shape with these symmetries must satisfy the 
condition (a,b,c) = (b,c,a), which forces a = b = c. 
This shape is the cube, which has additional symme- 
tries X, Y, and Z. This situation is typical of subgroups 
that are not isotropy subgroups; the symmetries of 
such a subgroup force additional symmetries that are 
not in the subgroup. 
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A first step toward classifying the possible symme- 
try-breaking solutions of a differential equation with 
a known symmetry group is to use group theory to 
list the isotropy subgroups. Techniques then exist to 
find solutions with a given isotropy subgroup. Only one 
isotropy subgroup from each conjugacy class need be 
considered because solutions come in symmetrically 
related sets. 

There are general theorems that guarantee the exis- 
tence of solutions with certain types of isotropy sub- 
group, but they are too technical to state here. Roughly 
speaking, symmetry-breaking solutions often occur for 
large isotropy subgroups (but ones smaller than the 
full symmetry group). The theorems make this state- 
ment precise. In particular, they explain why rod and 
plate solutions occur for the buckling cube. Other gen- 
eral theorems help to determine whether a solution is 
stable or unstable. 

6 Time-Periodic Solutions 

A famous example of pattern formation is the Belou- 
sov-Zhabotinskii (or BZ) chemical reaction. This in- 
volves three chemicals together with an indicator that 
changes color from red to blue depending on whether 
the reaction is oxidizing or reducing. The chemicals are 
mixed together and placed in a shallow circular dish. 
They turn blue and then red. For a few minutes noth- 
ing seems to be happening; then tiny blue spots appear, 
which expand and turn into rings. As each ring grows, 
a new blue dot appears at its center. Soon the dish 
contains several expanding “target patterns” of rings. 
Unlike water waves, the patterns do not overlap and 
superpose. Instead, they meet to form angular junc- 
tions (figure 6(a)). Different target patterns expand at 
different rates, giving rings with differing thicknesses. 
However, each set of rings has a specific uniform speed 
of expansion, and the rings in that set all have the same 
width. 

Another pattern can be created by breaking up a 
ring — by dragging a paperclip across it, for example. 
This new pattern curls up into a spiral (close to an 
Archimedean spiral with equally spaced turns), and the 
spiral slowly rotates about its center, winding more and 
more turns as it does so. 

Neither pattern is an equilibrium, so the theorems 
alluded to above are not applicable. Instead, we can 
employ a different but related series of techniques. 
These apply to time-periodic solutions, which repeat 
their form at fixed intervals of time. 


(a) (b) 



Figure 6 Patterns in the BZ reaction: (a) snapshot of several 
coexisting target patterns; (b) snapshot of two coexisting 
spirals. 


There are two main types of bifurcation in which the 
fully symmetric steady state loses stability as a parame- 
ter is varied, with new solutions appearing nearby. One 
is a steady-state bifurcation (which we met above), for 
which these new solutions are equilibria. The other is 
hopf bifurcation [IV.21 §2], for which the new solu- 
tions are time-periodic oscillations. Hopf bifurcation 
can occur for equations without symmetry, but there is 
a generalization to symmetric systems. The main new 
ingredient is that symmetries can now occur not just 
in space (the shape of the pattern) and time (integer 
multiples of the period make no difference), but in a 
combination of both. 

A single target pattern in the BZ reaction has a purely 
spatial symmetry: at any instant in time it is unchanged 
under all rotations about its center. It also has a purely 
temporal symmetry: in an ideal version where the pat- 
tern fills the whole plane, it looks identical after a time 
that is any integer multiple of the period. 

A single spiral, occupying the entire plane, has no 
nontrivial spatial symmetry, but it has the same purely 
temporal symmetry as a target pattern. However, it has 
a further spatiotemporal symmetry that combines both. 
As time passes, the spiral slowly rotates without chang- 
ing form. That is, an arbitrary translation of time, com- 
bined with a rotation through an appropriate angle, 
leaves the spiral pattern (and how it develops over time) 
unchanged. 

For symmetric equations there is a version of the 
Hopf bifurcation theorem that applies to spatiotem- 
poral symmetries when there is a suitable symmetry- 
breaking bifurcation. Again, its statement is too tech- 
nical to give here, but it helps to explain the BZ patterns. 
In conjunction with the theory for steady-state bifurca- 
tion, it can be used to understand pattern formation in 
many physical systems. 
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7 Taylor-Couette Revisited 

The Taylor-Couette system is a good example of 
the symmetry-breaking approach. A standard model, 
derived from the Navier-Stokes equations for fluid 
flow, involves three types of symmetry. 

Spatial: rotation about the common axis of the cylin- 
ders through any angle. 

Temporal: shift time by an integer multiple of the 
period. 

Model: translation along the common axis of the cylin- 
ders through any distance (a consequence of the 
assumption of periodic boundary conditions). 

Using bifurcation theory and a few properties of 
the Navier-Stokes equations, the dynamics of this 
model for the Taylor-Couette system can be reduced 
to an ordinary differential equation (the so-called cen- 
ter manifold reduction) in six variables. Two of these 
variables correspond to a steady-state bifurcation from 
Couette flow to Taylor vortices. This is steady in the 
sense that the fluid velocity at any point remains con- 
stant. The other four variables correspond to a Hopf 
bifurcation from Couette flow to spirals. These are the 
two basic “modes” of the system, and their combination 
is called a mode interaction. 

Symmetric bifurcation theory’s general theorems 
now prove the existence of numerous flow patterns, 
each with a specific isotropy subgroup. For example, 
if the rotational symmetry remains unbroken, but the 
group of translational symmetries breaks, a typical 
solution will have a specific translational symmetry, 
through some fraction of the length of the cylinder. All 
integer multiples of this translation are also symme- 
tries of that solution. This combination of symmetries 
corresponds precisely to Taylor vortices; it leads to a 
discrete, repetitive pattern vertically, with no change 
in the horizontal direction. 

A further breaking of the rotational symmetry pro- 
duces a discrete set of rotational symmetries; these 
characterize wavy vortices. Symmetry under any rota- 
tion, if combined with a vertical translation through a 
corresponding distance, characterizes spiral vortices, 
and so on. 

The only pattern in figure 3 that is not explained 
in this manner is the final one: turbulent vortices. 
This pattern turns out to be an example of symmet- 
ric chaos, chaotic dynamics that possesses “symmetry 


on average.” At each instant the turbulent vortex state 
has no symmetry. But if the fluid velocity at each point 
is averaged over time, the result has the same symme- 
try as Taylor vortices. This is why the picture looks like 
Taylor vortices with random disturbances. A general 
theory of symmetric chaos also exists. 

8 Conclusion 

The symmetries of physical systems appear in the equa- 
tions that model them. The symmetry affects the solu- 
tions of the equations, but it also provides systematic 
ways to solve them. One general area of application 
is pattern formation, and here the methods of sym- 
metric bifurcation theory have been widely used. Top- 
ics include animal locomotion, speciation, hallucina- 
tion patterns, the balance-sensing abilities of the inner 
ear, astrophysics, liquid crystals, fluid flow, coupled 
oscillators, elastic buckling, and convection. 

In addition to a large number of specific applica- 
tions, there are many other ways to exploit symmetry in 
applied mathematics. This article has barely scratched 
the surface. 
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IV. 2 3 Quantum Mechanic s 

David Griffiths 


Quantum mechanics makes use of a wide variety 
of tools and techniques from applied mathematics — 
linear algebra, complex variables, differential equa- 
tions, Fourier analysis, group representations— but the 
mathematical core of the subject is the theory of eigen- 
vectors and eigenvalues of self-adjoint operators in 
Hilbert space. 


1 A Single Particle in One Dimension 


Let us begin with the simplest mechanical system: a 
particle (which is physicists' jargon for an object so 
small, compared with other relevant distances, that it 
can be considered to reside at a single point) of mass 
m, constrained to move in one dimension (say, along 
the x-axis) under the influence of a specified force F 
(which may depend on position and time). The program 
of classical mechanics [IV. 19] is to determine the 
location of the particle at time t: x(t). How do you cal- 
culate it? By applying Newton’s second law of motion, 
F = ma (or d 2 x/di 2 = F/m). If the force is conser- 
vative (nonconservative forces, such as friction, do not 
occur at a fundamental level), it can be expressed as 
the negative gradient of the potential energy, V(x,t), 
and Newton’s law becomes d 2 x/dt 2 = -m^ 1 dV /dx. 
Together with appropriate initial conditions (typically, 
x and dx/d t at t = 0), this determines x(t). From there 
it is easy to obtain the velocity (v (t) = dx/d t), momen- 
tum (p = mv), kinetic energy (T = | mv 2 ), or any other 
dynamical quantity. 

The program of quantum mechanics is quite differ- 
ent. Instead of x(t), we seek the wave function, Y (x, t), 
which we obtain by solving schrodinger's equation 
[III. 2 6], 


fiw h 2 d 2 Y 

ih-d- = - +V (x,t)Y, 

dt 2m ox 2 


( 1 ) 


where h = 1.05457 x 10~ 34 J s is Planck's constant, the 
fingerprint of all quantum phenomena. Together with 
appropriate initial conditions (typically, Y(x, 0)) this 
determines Y(x,t). 

But what is this wave function, and how can it be said 
to describe the state of the particle? After all, as its 
name suggests, Y(x, t) is spread out in space, whereas 
a particle, by its nature, is localized at a point. The 
answer is provided by Born’s statistical interpretation 
of the wave function: Y(x, t) tells you the probability 


of finding the particle at the point x if you conduct 
a (competent) measurement at time t. More precisely, 
jk Y(x, t) | 2 dx is the probability of finding the particle 
between point a and point b. Evidently, the wave func- 
tion must be normalized: \Y(x, t) | 2 dx = 1 (the 

particle has got to be somewhere). Once normalized, 
at t = 0, Y remains normalized for all time. 

The statistical interpretation introduces a kind of 
indeterminacy into the theory, in the sense that even 
if you know everything quantum mechanics has to tell 
you about the particle (to wit: its wave function), you 
still cannot predict with certainty the outcome of a 
measurement to determine its position — all quantum 
mechanics has to offer is statistical information about 
the results from an ensemble of identically prepared 
systems (each in the state Y). 

The average, or expectation value, of x can be written 
in the form 

rOO 

(x) = Y(x,t)*xY(x,t) dx; 

J — 00 

similarly, the expectation value of momentum is 

(p) = | Y(x,t)*[-ih^Y(x,t)dx. 

This is a consequence of de Broglie’s formula relating 
momentum to the wavelength (A) of Y: A = 2nhlp. We 
say that the observables x and p are “represented” by 
the operators 

x = x, p = -ih-^—. (2) 

dx 

In general, a classical dynamical quantity Q_(x,p) is 
represented by the quantum operator 

Q = Q(x, -ifi(3/3x)) 


(simply replace every p by p), and its expectation value 
(in the state Y) is obtained by “sandwiching” Q between 
Y and its conjugate Y* , and integrating: 


r 00 

(Q) = J Y{x, t)*QY(x, t) dx. 

Thus, for example, kinetic energy T = \mv 2 = p 2 /2m 
is represented by the operator 

f h 2 d 2 
2m dx 2 ’ 

and the Schrodinger equation can be written 


dY 

i ft— = HY, 
dt 


where H = f + V is the Hamiltonian operator, repre- 
senting the total energy. 

If the potential energy is independent of time, the 
Schrodinger equation can be solved by separation of 
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variables: ¥ (x, t) = e~ mih i)j(x), where E is the separa- 
tion constant and < p(x) satisfies the time-independent 
Schrodinger equation : 


= + = (3) 
The expectation value of the total energy is 

r 00 

(H) = J ¥(x,t)*H¥(x,t)dx 

r 00 

= 1 ¥(x, t)*E¥(x, t) dx 


= E , 


hence the choice of the letter E. 

Suppose, for example, the particle is attached to 
a spring of force constant fc. Classically, this would 
be a simple harmonic oscillator, with potential energy 
V (x) = Ijkx 2 , and the (time-independent) Schrodinger 
equation reads 


H 2 d 2 Ip 1,1 r. 

— ^ + - 2 kx<p=EV. 


(4) 


( n -■ 
ipn(x) = 


~ x2/2a2 H n (xla), 


The normalized solutions have energies E n = (n 
i /■„„ _ o, 1, 2, ... ) and are given by 

1 

i J2 n n\afff 

where to = fk/m is the classical (angular) frequency 
of oscillation, a = fh/mco, and H n is the nth hermite 
polynomial [11.29]. (As a second-order ordinary dif- 
ferential equation, (4) admits two linearly independent 
solutions for every value of E, but all the other solu- 
tions are not normalizable, so they do not represent 
possible physical states.) The separable solutions of the 
time-dependent Schrodinger equation in this case are 
¥ n (x,t) = e~ lEntlh ip n {x). For instance, if the particle 
happens to be in the state n = 0, the probability of 
finding it outside the classical range (±f2 E/k) is 


r CO 

2 |¥o(x, t)| 2 dx = 1 - erf (1) = 0.1573, 

Ja 

where erf(x) = (2/ frt)^ e~ f2 dt is the standard error 
function. 

Of course, not all solutions to the Schrodinger equa- 
tion are separable, but the separable solutions do con- 
stitute a complete set, in the sense that the most gen- 
eral (normalizable) solution can be expressed as a lin- 
ear combination of them, ¥(x,t ) = Sn=o c n ¥ n (x, t). 
Furthermore, they are orthonormal, 


r 00 

J ¥„ (x, t)*¥ m (x, t) dx = 5 nm , 


where 8 nm is the kronecker delta [1.2 §2, table 3], so 
the expansion coefficients can be determined from the 


initial wave function in the usual way: 

r 00 

C n = lp n (x)*¥(x, 0) dx. 

J — 00 

2 States and Observables 

Formally, the state of a system is represented, in quan- 
tum mechanics, by a vector in hilbert space [1.2 §19.4]. 
In Dirac notation, vectors are denoted by kets |s) and 
their duals by bras (s|; the inner product of | s a ) and 
|5f,) is written as a “bra(c)ket” (s a \Sb) = Obl-Sa)*- 
For instance, the wave function ¥(x,t) resides in 
the (infinite-dimensional) space L 2 of square-integrable 
functions fix) on - oo < x < oo, with the inner product 

r 00 

(falfb) = J fa(x)*fb(x) dX. 

But there exist much simpler quantum systems in 
which the vector space is finite dimensional, and it pays 
to explore the general theory in this context first. 

In an n -dimensional space a vector is conveniently 
represented by the column of its components (with 
respect to a specified orthonormal basis): |s) = (ci, C 2 , 
...,c n ) (as a column), (s \ is its conjugate transpose 
(a row), and <s a |Si,) = Y.i=i(c a )? (cb)i is their matrix 
product. Obsen'ables (measurable dynamical quanti- 
ties) are represented by linear operators; for the sys- 
tem considered in section 1 they involve multiplica- 
tion and differentiation, but in the finite-dimensional 
case they are matrices, Q e C nxn (with respect to a 
specified basis). In its most general form, the statistical 
interpretation reads as follows. 

If you measure observable Q, on a system in the state 
| 5 ), you will get one of the eigenvalues of Q; the proba- 
bility of getting the particular eigenvalue A; is \ (u; U) I 2 , 
where |vj) is the corresponding normalized eigenvec- 
tor, Q|v;) = with (vdi’i) = 1. In the act of 

measurement the state “collapses” to the eigenvector, 
|s) — \Vi). (This ensures that an immediately repeated 
measurement — on the same particle — will return the 
same value.) 

Of course, the outcome of a measurement must 
be a real number, and the probabilities must add 
up to 1. This is guaranteed if we stipulate that the 
operators representing observables be self-adjoint, i.e., 
(SalQsfo) = ( Qs a I Sb) for all vectors | s a ) and | Sb ), which 
is to say that the matrix Q is Hermitian (equal to its 
transpose conjugate). (Physicists’ notation is sloppy 
but harmless: technically, “s” is just the name of the 
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vector, and you cannot multiply a matrix by a name. 
But \Q_s) means Q|s), of course, and (Qs\ is its dual.) 

The expectation value of Q, for a system in the 
state | 5 ), is (Q) = (s|Q|s), and the standard devia- 
tion <tq —known informally as the uncertainty > in Q, and 
sometimes written AQ— is given by 

Oq = <(Q- <Q> ) 2 > = <Q 2 > - <Q> 2 - 

It is zero if and only if |s) is an eigenvector of Q (in 
which case a measurement is certain to return the cor- 
responding eigenvalue). Because noncommuting oper- 
ators do not admit common eigenvectors, it is not pos- 
sible, in general, to construct states with definite val- 
ues of two different observables (say, A and B). This is 
expressed quantitatively in the uncertainty principle : 

o-aCTb> z\{[A,B])\, (5) 

where square brackets denote the commutator, 

[A,B] = AB - BA. 


Consider, for example, a two-dimensional space in 
which the state is represented by a (normalized) spinor, 
|s) = (ci , C 2 ) , such that |ci| 2 + | C 2 1 2 = 1. The most 
general 2x2 Hermitian matrix has the form 

a ( tiQ + a . 3 a . i ia 2 \ C - 1 

Q = . = > fljCTj, 

\fli + i«2 «o - aj, ) rj 

where the aj s are real numbers, (To is the unit matrix, 
and the other three crjS are the Pauli spin matrices : 


01 = 




( 6 ) 

The Pauli matrices constitute a basis for the Lie algebra 
of SU(2), the group of “special” (unimodular) unitary 
2x2 matrices. 

Suppose we measure the observable represented by 
( 73 . According to the statistical interpretation, the out- 
come could be + 1 with probability P+ = | ci | 2 or - 1 
with probability P_ = | C 2 1 2 , since the eigenvalues and 
eigenvectors in this case are obviously A+ = +1, with 
|s+) = (1,0), and A_ = -1, with |s_) = (0,1). But 
what if we chose instead to measure ( 7 i? The eigen- 
values are again ± 1 , but the (normalized) eigenvec- 
tors are now | 5 +) = — ^ (1,1) and |s_) = — ^ ( 1 , — 1 ) . 
The possible outcomes are the same as for < 73 , but the 
probabilities are quite different: P+ = j|ci + C 2 I 2 and 
P- = j | Ci — C 2 I 2 . Clearly, a system cannot simultane- 
ously be in an eigenstate of (73 and of <7i, and hence 
it cannot have definite values of both observables at 
the same time. What if I measure (73 and get (say) +1, 
and then you measure <Ti and get (say) -1. Does this 


not mean that the system has definite values of both? 
No, because each measurement altered (collapsed) the 
state: (ci,C 2 ) -> (1,0) -» -^(1, -1). If I now repeat my 
measurement of (73, 1 am just as likely to get -1 as + 1 . 
It is not that I am ignorant— l know the state of the 
system precisely— but it simply does not have a defi- 
nite value of (73 if it is in an eigenstate of <ri. Indeed, 
since [( 73 ,( 71 ] = 2 i<T 2 , the uncertainty principle says 
that A 1 A 3 Js I (cr 2 ) I . 

3 Continuous Spectra 

In a finite-dimensional vector space every Hermitian 
matrix has a complete set of orthonormal eigenvectors, 
and implementation of the statistical interpretation is 
straightforward. But in an infinite-dimensional Hilbert 
space some or all of the eigenvectors may reside out- 
side the space. For example, the eigenvectors of the 
position operator (2) are dirac delta functions [III. 7] 

xf\(x ) = A fx(x) => f\(x) = AS (x - A), (7) 

which are not square integrable, so they do not live in L 2 
and cannot represent physically realizable states. The 
same goes for momentum, 

= A/a => h = A'e iAx/n . ( 8 ) 

Evidently, a particle simply cannot have a definite posi- 
tion or momentum in quantum mechanics. Moreover, 
since x and p do not commute ([x,p] = i h), (5) says 
that cr A -( 7 p ^ \h, which is the original Heisenberg 
uncertainty principle. 

Even though the eigenfunctions of x and p are not 
possible physical states, they are complete and orthog- 
onal; the wave function can be expanded as a linear 
combination of them, and the (absolute) square of the 
coefficient represents the probability density for a mea- 
surement outcome. Note that nonnormalizable eigen- 
functions are associated with a continuous spectrum of 
eigenvalues; probabilities become probability densities, 
and discrete sums are replaced by integrals. Adopting 
the convenient Dirac convention (/\ |/() = 5(A-A') (so 
that A = 1 in (7) and A! = 1/V2nfr in ( 8 )), we have 

P(x,t) = | c\f\(x) dA 

« c\ = | /a (x) * ¥ r (x, t) dx. 

For eigenfunctions of position we get 

r 00 

ca = J 5(x - \)Y(x,t) dx = ¥{A,t), 

which is to say that the expansion coefficient is pre- 
cisely the wave function itself (and we recover Born’s 
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original statistical interpretation). For eigenfunctions 
of momentum, 


00 

e -lA x l n Y(x,t) dx = 


C\ 


which is the so-called momentum space wave function 
(mathematically, the Fourier transform of the position 
space wave function ¥). Of course, ¥ and <P describe 
the same state vector, referred to two different bases. 

In general, E ^ min{V(x)}; in the case of a free 
particle (V = 0) of energy E ^ 0, we have, from (3), 


ft 2 d 2 ip 
2m dx 2 ^ 


(// (x) = Ae lkx , 


so ¥{x,t) = Ae 1(kx ~ Et/h \ where k 2 = 2mE/h 2 . If k is 
positive, ¥ is a wave propagating in the +x direction (if 
k is negative, it travels in the -x direction). However, ¥ 
is not normalizable, so it does not represent a possible 
physical state — there is no such thing as a free particle 
with precisely determined energy. But we can construct 
a normalizable linear combination of these states: 


00 

¥(x,t) = c(k)e 1(fc *~ fe2ftt/2m) dk. 


For instance, we might start v\ith a Gaussian wave 
packet of width cro: 


¥(x,0) 


(27TCTq )!/4 


p -(x/2iTo) 2 p i!x 


Then c(k) = (8ncrZ) l/4 e ^ a °^ k ( ^ 2 , so 


¥(x,t) = 


and hence 


a-i 2 c T„ 2 


(2ttctq) 1 1 4 ^1 + iHt/2mcrfi 

( [(ix + 2/tjQ )/2 <t 0 ] 2 - 


x exp 


r 


1 + iht/2mcrQ 


\¥(x, t)\ 2 = ( 2nd 2 ) 1/2 exp{-2[(x - Iht /m) /2cr] 2 } , 

where cr(t) = aoJl + (ht/2mcrQ ) 2 . The wave packet 
travels at speed lh / m, spreading out as it goes. This is 
the quantum description of a free particle in motion. 

Some Hamiltonians (e.g., the harmonic oscillator) 
have a discrete spectrum, with normalizable eigen- 
states; some (e.g., the free particle, V = 0) have a contin- 
uous spectrum and nonnormalizable eigenstates. Many 
systems have some of each. For example, the delta- 
function well, V(x) = -aS(x) (for some positive con- 
stant a), admits a single normalizable state (f'(x) = 
(V ma / j%)£- m <*\ x \l ft" with energy £ = -ma 2 /2h 2 . This 
represents a bound state : the particle is “stuck” in the 
well. If (as here) V(x) — 0 as |x| — - oo, a bound state 
is typically characterized by a negative energy. But the 


delta-function well also admits scattering states, which 
have the form 

_\ Ae ikx + Be~ ikx , x^O, 

^ + Q e - ikx , x ^ o, 

where k = V2 mE/h with E > 0. Here A is the amplitude 
of an incoming wave from the left, and G is the ampli- 
tude of an incoming wave from the right; B is an out- 
going wave to the left, and F an outgoing wave to the 
right (remember, <// is to be combined with the time- 
dependent factor exp(-LEf7fi)). The outgoing ampli- 
tudes are determined from the incoming amplitudes 
by the boundary conditions at x = 0: 

Al/y = 0 ’ A fe) = — ^ (0) ' 

In the typical case of a particle incident from the left, 
G = 0, and the reflection coefficient (the probability of 
reflection back to the left) is 

n HI 2 1 

\A\ 2 l + (2 h 2 E/ma 2 )’ 

while the transmission coefficient (the probability of 
transmission through to the right) is 

If I 2 1 

\A\ 2 1 + (ma 2 l2h 2 E)' 

Naturally, R + T = 1. Of course, these scattering states 
are not normalizable, and to represent an actual par- 
ticle we should, in principle, form normalizable linear 
combinations of them. 


4 Three Dimensions 


Quantum mechanics extends to three dimensions in the 
obvious way: d 2 /dx 2 V 2 (the three-dimensional La- 
placian), and the time-independent Schrodinger equa- 
tion becomes 

h 2 

V 2 ip + Vip = EifJ. 

2m 

We will use spherical coordinates (r, 6, <fi), where 6 is 
the polar angle, measured down from the z-axis (0 ^ 
6 ^ tt), and <p is the azimuthal angle around from the 
x-axis (0 ^ <f> < 2tt). In the typical case where V is a 
function only of r, we solve by separation of variables; 
i p(r, 6, (f>) = (l/r)u(r)Y™ (0, <p), where 

Y! n (6, (f>) = (-l) m J (2l+1) (l ~ m)l e i m 4>pj n ( cosg) 
V 4tt (l + m)\ L 

(9) 

is a spherical harmonic (l = 0, 1, 2, ... , m = -(,...,(), 
and P/ n is an associated Legendre function. (Equa- 
tion (9) applies for m ^ 0; for negative values, Yf m = 
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(-l) m (Y ; m )*.) Meanwhile, u(r) satisfies the radial 
equation 


ftydhi T fi 2 1(1+ 1) ' 

2m dr 2 L 2m r 2 


u = Eu, 


( 10 ) 


which is identical to the one-dimensional Schrodinger 
equation, except that the effective potential energy 
carries an extra centrifugal term. 

The spherical harmonics are eigenfunctions of the 
angular momentum operator, L = r x p (L z = 
-ih(xd /dy-yd/dx), and cyclic permutations thereof). 
The components of angular momentum do not com- 


mute, 


[Li,Lj] = i hLk with ( ijk ) = ( xyz ), etc., (11) 


so it would be futile to look for simultaneous eigen- 
states of all three. However, each component does com- 
mute with the square of the total angular momentum, 
L 2 = L 2 + L 2 , + L 2 , so it is possible to construct simul- 
taneous eigenfunctions of L 2 and (say) L z . In spherical 
coordinates, L 2 is -h 2 times the Laplacian restricted to 
the unit sphere (r = 1), L z = ~ihd/d<p, and spherical 
harmonics are their eigenstates: 


I 2 y™ = h 2 l(l + 1 )T ; m , I z y ; m = hmY^ 1 . (12) 


In addition to its orbital angular momentum, L, every 
particle carries an intrinsic spin angular momentum, S 
(distantly analogous to the daily rotation of the Earth 
on its axis, while the orbital angular momentum cor- 
responds to its annual revolution about the sun). The 
components of 5 satisfy the same fundamental com- 
mutation relations as L, namely, [S x , Sy] = i hS z (and 
its cyclic permutations), and admit the same eigen- 
values (with 5 and m s in place of l and m), S 2 \sm s ) = 
h 2 s(s + l)|5m 5 ), and S z \sm s ) = hm s \sm s ). However, 
whereas l can only be an integer, s can be an integer or 
a half-integer; it is a fixed number, for a given type of 
particle (0 for tt mesons, ^ for protons and electrons, 
1 for photons, | for the Q and so on); we call it the 
spin of the particle. In the case of spin there are just 
two possible values for m s '. ^ (“spin up”) and - ^ (“spin 
down”), and the resulting two-dimensional state space 
is precisely the one we studied in section 2, with the 
association S x = § fieri, S y = *fi<X2, S z = ^03. 

Hydrogen is the simplest atom, consisting of a sin- 
gle electron bound to a single proton by the electrical 
attraction of opposite charges. The proton is almost 
2000 times heavier than the electron, so it remains 
essentially at rest, and the potential energy of the elec- 
tron is given by Coulomb's law, V(r) = - e 2 / (4tt Eor ) , 
where e = 1.602 x 10~ 19 C is the charge of the proton 


(— e is the charge of the electron). The radial equation 
(10) reads 

fi 2 d 2 w r 1 e 2 h 2 1(1 + 1)1 

2m dr 2 + l_ 4 tt£o r + 2 m r 2 I U ' 

It admits a discrete set of bound states, with energies 

Fi VYie^ 

E ^-2l4n ^ = - 13 - 606 eV (13) 

for n = 1,2,3,... (as well as a continuum of scattering 
states with E > 0). The normalized wave functions are 

* Pnlm — 

where Lq (x) is an associated laguerre polynomial 
[11.29], and a = 4rt£oh 2 /me 2 = 0.5292 x 10~ 10 m is the 
Bohr radius. 

The principal quantum number, n, tells you the 
energy of the state (13); the (misnamed) azimuthal 
quantum number l, which ranges from 0 to n - 1, spec- 
ifies the magnitude of the orbital angular momentum, 
and m gives its z component (12). Since 5 = 4 for the 
electron, there are two linearly independent orienta- 
tions for its spin. All told, the degeneracy of E n (that 
is, the number of distinct states sharing this energy) 
is 2 n 2 . 

If the atom makes a transition, or “quantum jump,” 
to a state with lower energy, a photon (quantum of 
light) is emitted, carrying the energy released. (With the 
absorption of a photon it could make a transition in the 
other direction.) The energy of a photon is related to its 
frequency v by the Planck equation E = 2nhv, so the 
spectrum of hydrogen is given by 

A — R ( n Hna | — ^initial)' (15) 

with 

R = ■ = H097X 10 7 nr 1 , (16) 

Trrcfi 2 \4tteq J 

where A = c/v is the wavelength (color) of the light. 
The Rydberg formula (15), with R as an empirical con- 
stant, was discovered experimentally in the nineteenth 
century; Bohr derived it (and obtained the expression 
for R) in 1913, using a serendipitous mixture of inappli- 
cable classical physics and primitive quantum theory. 
Schrodinger put it on a rigorous theoretical footing in 
1924. 




416 


IV. Areas of Applied Mathematics 


5 Composite Structures 


The theory generalizes easily to systems with more 
than one particle. For example, two particles in one 
dimension would be described by the wave function 
¥(x\,X 2 , t), where 

rb l 

I \¥(x\,X 2 , t ) | 2 dxi dx 2 

J X\=a\ JX2=a,2 

is the probability of finding particle 1 between a\ and 
b i and particle 2 between a 2 and bo, if a measurement 
is made at time t. 

But here quantum mechanics introduces a new twist: 
suppose the two particles are absolutely identical, so 
there is no way of knowing which is #1 and which 
#2. In classical physics such indistinguishability is 
unthinkable — you could always stamp a serial number 
on each particle. But you cannot put labels on electrons; 
they simply do not possess independent identities— the 
theory must treat the two particles on an equal footing. 
There are essentially two possibilities (others occur in 
special geometries): under interchange, ¥ can be sym- 
metric, ¥{X 2 ,X\) = ¥ (xi, X 2 ) (bosons), or antisymmet- 
ric, ¥(X 2 ,xi) = — ¥ (x\ , X 2 ) (fermions). Some elemen- 
tary particles (electrons and quarks, for example) are 
fermions; others (photons and pions, for instance) are 
bosons. The distinction is related to the spin of the par- 
ticle: particles of integer spin are bosons, whereas parti- 
cles of half-integer spin are fermions. (This “connection 
between spin and statistics” can be proved using rela- 
tivistic quantum mechanics, but in the nonrelativistic 
theory it is simply an empirical fact.) 

Suppose we have two particles, one in state ip a (x) 
and the other in state 1 pb(x). If the particles are dis- 
tinguishable (an electron and a proton, say) and the 
first is in state i// a , then the composite wave function 
is (//(xi,x 2 ) = <Pa(xi)<//f,(x 2 ). But if they are identi- 
cal bosons (two pions, say), we must use the symmetric 
combination, 


V(x i,x 2 ) 


72 


[Va(Xl)1’b(X2) + lp a (X2)Vb(x 1 )], 


and for identical fermions (two electrons, say), 
1 


ip(x i,x 2 ) 


72 


[lpa(Xi)ipb(X2) ~ Wa(X2)<Pb(x l)]. 


(The normalization factor 1/72 assumes that ip a and 
i pb are orthonormal.) 

For example, we might calculate the expectation 
value of the square of the separation distance between 
the two particles. If they are distinguishable, then 


<(Xi - X 2 ) 2 ) = (X 2 ) a + (X 2 )b ~ 2(x) a (x)b = A 2 , 


where ( x) a is the expectation value of x for a parti- 
cle in the state ip a , and so on. If, however, the parti- 
cles are identical bosons, then there is an extra term 
<(Xi — X 2 ) 2 ) = A 2 - 2\(x)ab\ 2 , and if they are identical 
fermions, then ((xi - x 2 ) 2 ) = A 2 + 2|(x) a b| 2 , where 

(x)ab = ^ Xlp a (x)*qjb(x) dx. 

Thus bosons tend to be closer together, and fermions 
farther apart, than distinguishable particles in the 
same two states. It is as if there were an attractive 
force between identical bosons and a repulsive force 
between identical fermions. We call these exchange 
forces, though there is no actual force involved— it is 
just an artifact of the (anti)syimnetrization require- 
ment. Exchange forces are responsible for ferromag- 
netism, and they contribute to covalent bonding. (If you 
include spin, it is the total wave function that must 
be antisymmetrized; if the spin part is antisymmetric, 
then the position wave function must actually be sym- 
metric (for fermions), and the exchange force is attrac- 
tive. Covalent bonding occurs when shared electrons 
cluster in the region between two nuclei, tending to pull 
them together.) 

The (anti)syimnetrization requirement gives rise to 
very different statistical mechanics for bosons and 
fermions. In particular, two identical fermions cannot 
occupy the same state, since if ip a = ipb the com- 
posite wave function ip(x i,x 2 ) vanishes. This is the 
famous Pauli exclusion principle. There is no such rule 
for bosons, and at extremely low temperatures iden- 
tical bosons tend to congregate in the ground state, 
forming a Bose-Einstein condensate. In a large sample 
at (absolute) temperature T, the most probable number 
of particles in a (one-particle) state with energy E is 


n(E) = 


' e ~i.E~b)/kT^ 

1 

e< E-b)lkT + 

1 

_ e iE-b)/kT _ 1 ’ 


Maxwell-Boltzmann, 

Fermi-Dirac, 

Bose-Einstein. 


Here, k = 1.3807 x 1CT 23 J K _1 is Boltzmann’s con- 
stant, and p is the chemical potential — it is a function 
of temperature and depends on the nature of the parti- 
cles. The Maxwell-Boltzmann distribution (the classical 
result) applies to distinguishable particles, the Fermi- 
Dirac distribution is for identical fermions, and the 
Bose-Einstein distribution is for identical bosons. 

The Pauli principle is crucial in accounting for the 
periodic table of the elements. Atoms are labeled by 
their atomic number Z, the number of protons in the 
nucleus, which is also the number of electrons in orbit 
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(the number of neutrons distinguishes different iso- 
topes). Since the nucleus remains essentially at rest, 
only the behavior of the electrons is at issue in atomic 
physics. Hydrogen has one electron, helium has two 
(and two protons in the nucleus), lithium three, and so 
on. Ordinarily, each electron will settle into the lowest- 
energy accessible state. But the Pauli principle forbids 
any given state from being occupied by more than one 
electron, so they fill up the allowed states in succes- 
sion, and this means that the outermost (valence) elec- 
trons (which are responsible for chemical behavior) are 
differently situated for different elements. 

The Hamiltonian for an atom with atomic number Z 
is 

h = y \ v 2 - 1 Ze 2 ] + g2 y 1 

2m J 4tte 0 rj j 8 tt£ 0 \rj - r k \’ 

where rj is the position of the Jth electron, rj = \rj\, 
and Vj is the Laplacian with respect to rj. The term 
in curly brackets represents the kinetic and potential 
energy of the j th electron in the electric field of the 
nucleus, and the other term is the potential energy 
associated with the mutual repulsion of the electrons. 
(A tiny relativistic correction and magnetic coupling 
between the spin of the electron and the orbital motion 
account for fine structure, while an even smaller mag- 
netic coupling between the spin of the electron and the 
spin of the nucleus leads to hyperfine structure.) 

If we ignore the electron-electron repulsion alto- 
gether, we simply have Z independent electrons, each 
in the field of a nucleus of charge Ze; we simply copy 
the results for hydrogen, replacing e 2 by Ze 2 . The first 
two electrons would occupy the orbital, n = 1,1 = 0, 
m = 0, the next eight fill out the n = 2 orbitals, and 
so on. This simple scheme explains the first two rows 
of the periodic table (up through neon). One would 
anticipate a third row of eighteen, but after argon 
the electron-electron repulsion finally catches up, and 
potassium skips to n = 4, l = 0 for the next electron, 
in preference to n = 3, l = 2. The details get com- 
plicated and are handled by sophisticated approxima- 
tion schemes (hydrogen is the only atom for which the 
Schrodinger equation can be solved analytically). 

The addition of angular momenta is an interesting 
problem in quantum mechanics. Suppose I want to 
combine |limi) with I/ 2 W. 2 ); what is the resulting com- 
bined state, | lm)7 For instance, the electron in a hydro- 
gen atom has orbital angular momentum and spin 
angular momentum; what is its total angular momen- 
tum? (I will use l as the generic letter for the angular 


momentum quantum number — it could be orbital, or 
spin, or total, as the case may be.) Because L = L\ + L 2 , 
the z components add: m = mi + m 2 . But the total 
angular momentum quantum number 1 depends on the 
relative orientation of L\ and L 2 , and it can range from 
| it - I 2 1 (roughly speaking, when they are antiparallel) 
to Zj + I 2 (parallel), in integer steps: 

/ = — |li — I 2 1, — |li — ^ 2 1 + 1 , - - - , ii + I 2 ~ Hit + Q- 
Specifically, 

I 1 +I 2 

\hmQ\l 2 m 2 ) = X C l mfm 2 m\lm) , 

1= \h-hl 

where Cm{m 2 m are the so-called Clebsch-Gordan coeffi- 
cients, which are tabulated in many handbooks. Mathe- 
matically, what we are doing is decomposing the direct 
product of two irreducible representations of SU(2) 
(the covering group for SO(3), the rotation group in 
three dimensions) into a direct sum of irreducible 
representations. For instance, 

If you had a hydrogen atom in the state V 2 , and the 
electron had spin up ( m s = |), the probability is | that 
a measurement of the total l would return the value | 
and | that it would yield | (a measurement of m would 
be certain to give §). The Clebsch-Gordan coefficients 
work the other way, too. If I have a system in the angular 
momentum state | Im) , composed of two particles with 
h and I 2 , and I want to know the possible values of mi 
and m 2 , 1 would use 

h h 

I lm) = X X Cm l im 2 m\hmi)\l2m2), 

mi=-li m.2--l2 

where the sum is over all combinations of m 1 and m 2 
such that mi + m 2 = m. 

6 Implications, Applications, Extensions 

In quantum mechanics the state of a system is repre- 
sented by a (normalized) vector (lx)) in Hilbert space; 
observables are represented by (self-adjoint) operators 
acting on vectors in this space. The theory rests on 
three pillars: the Schrodinger equation, which in its 
most general form tells you how the state vector |s) 
evolves in time, 

ifi^|5)=H|5>; (17) 

the statistical interpretation, which tells you how |5> 
determines the outcome of a measurement; and the 
(anti)symmetrization requirement, which tells you how 
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to construct |s) for a collection of identical particles. 
Although the calculational procedures are unambigu- 
ous and the results are in spectacular agreement with 
experiment, the physical interpretation of quantum 
mechanics (the story we tell ourselves about what we 
are doing) has always been controversial. Some of the 
major issues are listed below. 

Indeterminacy. Einstein argued— most persuasively in 
the famous EPR (Einstein-Podolsky-Rosen) para- 
dox— that quantum indeterminacy means the theory 
is incomplete ; other information (so-called hidden 
variables ) must supplement the state vector to pro- 
vide a full description of physical reality. By contrast, 
Bohr’s Copenhagen interpretation holds that systems 
simply do not have definite properties (such as posi- 
tion or angular momentum) prior to measurement. 
Nonlocality. The EPR argument is predicated on an 
assumption of locality', no influence can propagate 
faster than light. But in quantum mechanics, two par- 
ticles can be entangled such that a measurement on 
one determines the state of the other and, if they are 
far apart, then that “influence” (the collapse of the 
wave function) is instantaneous. In 1964 John Bell 
proved that no local deterministic (hidden-variable) 
theory can be compatible with quantum mechanics. 
In subsequent experiments the quantum mechani- 
cal predictions were sustained. Evidently, nonlocality 
is a fact of nature, though the “influences” in ques- 
tion are not strictly causal— they produce no effects 
that can be discerned by examining the second par- 
ticle alone. (A causal nonlocal influence would be 
incompatible with special relativity.) 

Measurement. It has never been clear what consti- 
tutes a “measurement,” as the word is used in the 
statistical interpretation. Why does a measurement 
force the system (which previously did not possess 
a determinate value of the observable in question) 
to “take a stand,” and how does it collapse the wave 
function? Does “measurement” mean something peo- 
ple in white coats do in the laboratory, or can it 
occur in a forest, when no one is looking? Does mea- 
surement require the leaving of a permanent record 
or the interaction of a microscopic system with a 
macroscopic device? Does it involve the intervention 
of human consciousness? Or does it perhaps split 
reality into many worlds? After nearly a century of 
debate, there is no consensus on these questions. 
Decoherence. If an electron can have an indeterminate 
position, what about a baseball? Could a baseball be 


in a linear combination of Seattle and San Francisco 
(until, I guess, the batters swing, and there is a home 
run in Seattle and nothing at all in San Francisco)? 
There is something absurd about the very idea of a 
macroscopic object being in two places at once (or an 
animal— in Schrodinger's famous cat paradox— being 
both alive and dead). Why do macroscopic objects 
obey the familiar classical laws, while microscopic 
objects are subject to the bizarre rules of quantum 
mechanics? Presumably if you could put a baseball 
into a state that stretched from Seattle to San Fran- 
cisco, its wave function would very rapidly decohere 
into a state localized at some specific place. By what 
mechanism? Maybe multiple interactions with ran- 
dom impinging particles (photons left over from the 
Big Bang, perhaps) constitute a succession of “mea- 
surements” and collapse the wave function. But the 
details remain frustratingly elusive. 

1 have mentioned two kinds of solutions to the 
Schrodinger equation, for time-independent potentials: 
(normalizable) bound states, at discrete negative ener- 
gies, and (nonnormalizable) scattering states, at a con- 
tinuum of positive energies. There is a third important 
case, which occurs for periodic potentials, V{x + a) = 
V(x), as in a crystal lattice. For these the spectrum 
forms bands, with continua of allowed energies sepa- 
rated by forbidden gaps. This band structure is crucial 
in accounting for the behavior of solids and underlies 
most of modern electronics. 

Very few quantum problems can be solved exactly, 
so a number of powerful approximation methods have 
been developed. Here are some examples. 

The variational principle. The lowest eigenvalue of a 
Hamiltonian H is less than or equal to the expectation 
value of H in any normalized state |5>: 

Eo£(s\6\s). (18) 

To get an upper bound on the ground-state energy of 
a system, then, you simply pick any (normalized) vec- 
tor 1 5 ) and calculate (s\H\s) in that state. Ordinarily, 
you will get a tighter bound if your “trial state” bears 
some resemblance to the actual ground state. In a 
typical application the trial state carries a number of 
adjustable parameters, which are then chosen so as 
to minimize (H). The binding energy of helium, for 
example, has been calculated in this way and is con- 
sistent with the measured value to better than eight 
significant digits. 
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Time-independent perturbation theory. Suppose you 
know the eigenstates {\n)} and eigenvalues {E n } of 
some (time-independent) Hamiltonian H. Now you 
perturb that Hamiltonian slightly, H — H + H'. The 
resulting change in the nth eigenvalue is approxi- 
mately equal to the expectation value of H r in the 
unperturbed state |n>: 

A(£ n ) * <n[H'|n>. (19) 

For example, the fine structure of hydrogen can be 
calculated very accurately in this way, starting with 
the wave functions in (14). 

Time-dependent perturbation theory. Again, suppose 
you know the eigenstates and eigenvalues of some 
time-independent Hamiltonian, and now you (briefly) 
turn on a small time-dependent perturbation, H' (t). 
What is the transition rate (probability per unit time) 
to go from state | n) to state | m ) ? The answer is given 

by 2 

2 >i - m — — j— \H mn I ~ p{E m ), 

where H' mn = (m\H'\n) is the so-called matrix 
element for the transition, and p is the density of 
states — the number of states per unit energy. Fermi 
called this the golden rule. It is used, for example, to 
calculate the lifetime of an excited state. 

In two important respects quantum mechanics is 
obviously not the end of the story. First of all, the 
Schrodinger equation as it stands is inconsistent with 
special relativity — as a differential equation it is sec- 
ond order in x but first order in t, whereas relativ- 
ity requires that they be treated on an equal footing. 
Dirac introduced the eponymous relativistic equation 
for particles of spin \ (the dirac equation [III.9]) and 
others followed (Klein-Gordon for spin 0, Proca for 
spin 1, and so on). Second, while the particles have been 
treated quantum mechanically, the fields have not. In 
the hydrogen atom, for example, the electric potential 
energy was taken from classical electrostatics. In a fully 
consistent theory the fields, too, must be quantized. 
The quantum of the electromagnetic field is the pho- 
ton, but although I have used this word once or twice, it 
has no place in quantum mechanics; it belongs instead 
to quantum field theory (specifically quantum electro- 
dynamics). In the standard model of elementary par- 
ticles, all known interactions save gravity are success- 
fully handled by relativistic quantum field theory. But a 
fully consistent union of quantum mechanics and gen- 
eral relativity [IV.40] (Einstein’s theory of gravity) 
still does not exist. 
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IV.24 Random-Matrix Theory 

Jonathan Peter Keating 


1 Introduction 

Linear algebra and the analysis of systems of linear 
equations play a central role in applied mathemat- 
ics; for example, quantum mechanics [IV.23] is a lin- 
ear wave theory in which observables are represented 
by linear operators; linear models are fundamental in 
electromagnetism, acoustics, and water waves; and lin- 
earization is important in stability analysis. When a 
linear system is intrinsically complex, either by virtue 
of external stochastic forcing or because of an under- 
lying, persistent self-generated instability, it is natu- 
ral to model it statistically by assuming that the ele- 
ments of the matrices that appear in its mathematical 
description are in some sense random. This is simi- 
lar, philosophically, to the way in which statistical fea- 
tures of long trajectories in complex dynamical sys- 
tems are modeled statistically via notions of ergodic- 
ity and mixing. An example is when classical dynamics 
is modeled by statistical mechanics, that is, where one 
deduces statistical properties of the solutions of sys- 
tems of equations, in this case Newton’s equations of 
motion, by analyzing ensembles of similar trajectories 
and invoking notions of ergodicity (that time averages 
equal ensemble averages in the appropriate limit). 

This is one significant motivation for exploring the 
properties of random matrices. There are, however, 
many others: linear algebra and probabilistic models 
both play a foundational role in mathematics, and 
it is therefore not surprising that they combine in 
a wide range of applications, including mathematical 
biology, financial mathematics, high-energy physics, 
condensed matter physics, numerical analysis, neuro- 
science, statistics, and wireless communications. 

In many of the examples listed above, one has a sys- 
tem of linear equations that can be written in matrix 
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form. One might then be interested in the eigenvalues 
or eigenvectors of the matrix in question, if it is square. 
For example, one might want to know the range of 
the spectrum, its density within that range, and the 
nature of fluctuations/correlations in the positions of 
the eigenvalues. Alternatively, one might want an esti- 
mate of the condition number of the matrix or the 
values taken by its characteristic polynomial. 

If the matrix possesses no special features— that is, 
if its entries do not obey simple rules— it is natural to 
consider whether it behaves like one that is a “typical” 
member of some appropriate class of matrices. This 
again motivates the study of random matrices, that is, 
matrices whose entries are random variables. 

The basic formulation of random-matrix theory is 
that one has a space of matrices X endowed with 
a probability measure P(X). This is termed a matrix 
ensemble. We shall here focus on square matrices of 
dimension N. One can then seek to determine the prob- 
ability distribution of the eigenvalues and eigenvectors, 
and of other related quantities of interest. The motiva- 
tion is that in many cases one can prove that spectral 
averages for a given matrix coincide, with probability 
1, with ensemble averages, when N — ■ oo. 

The following are examples of such matrix ensem- 
bles. 

• Let X be an N x N matrix with real or complex- 
valued entries X mn satisfying X mn = X nm (so the 
matrix is either real symmetric or complex Hermi- 
tian). Take the entries X nm , n < m, to be inde- 
pendent zero-mean real or complex- valued ran- 
dom variables, and the entries X nn to be indepen- 
dent, identically distributed, centered real-valued 
random variables, so P(X) factorizes as P(X) = 
rim^n Pmn (X mn ). This defines the ensemble of 
Wigner random matrices. 

• Let X be an N x N real symmetric matrix. When the 
probability measure is invariant under all orthog- 
onal transformations of X, i.e., when P(OXO r ) = 
P(X) for all orthogonal (00 T = I) matrices O, then 
the ensemble is called orthogonal invariant. 

• Let A" be an IV x IV complex Hermitian matrix. When 
the probability measure is invariant under all uni- 
tary transformations of X, i.e., when P{UXU t) = 
P(X) for all unitary (UU * = I) matrices U, then the 
ensemble is called unitary invariant. 

• The Gaussian orthogonal ensemble (GOE) is the 
(unique) orthogonal invariant ensemble of Wigner 
random matrices. It has P(X) oc exp(- \ Tr X 2 ). 


Similarly, the Gaussian unitary / ensemble (GUE) is 
the (unique) unitary invariant ensemble of Wigner 
random matrices. It has P(X) oc exp(-j Tr A 2 ). 

• The most general class of random N x N matri- 
ces corresponds to taking the matrix elements 
X mn to be real or complex-valued independent 
identically distributed random variables with, for 
example, zero mean and unit variance. When these 
random variables have a Gaussian distribution, 
the matrices form what is known as the Ginibre 
ensemble. 

• The circular ensembles correspond to N x N uni- 
tary matrices with probability measures that are 
invariant under all orthogonal (COE) or unitary 
(CUE) transformations. Matrices in the COE are 
unitary and symmetric. Alternatively, one can con- 
sider ensembles corresponding to the Haar mea- 
sure on the classical compact groups (i.e., the mea- 
sure that is invariant under the group action): 
for example, the orthogonal group O (TV), com- 
prising N x N orthogonal matrices, or the unitary 
group U(JV), which coincides with the CUE, or the 
symplectic group Sp(2iV). 

• Let X be an N x k matrix, each row of which is 
drawn independently from a k -variate normal dis- 
tribution with zero mean. The Wishart ensemble 
corresponds to the matrices A' 1 A. 

These are far from being the only examples that 
are important— one can also define random matri- 
ces whose elements are quaternionic, or which have 
additional structure (i.e., invariance under additional 
symmetries)— but they will suffice to illustrate a num- 
ber of general questions and themes. 

For any of the above ensembles of random matrices, 
one can first ask where the eigenvalues typically lie, 
and what their mean density is. In ensembles where the 
eigenvalues are real, for example, how many can one 
expect to find in a given interval on average? One can, 
in particular, ask whether there is a limiting density 
when the matrix size tends to infinity. 

Second, one can seek to understand fluctuations 
about the mean in the eigenvalue distribution. For 
example, in an interval of length such that the expected 
number of eigenvalues is k, what is the distribution of 
the actual numbers in the interval for different matrices 
in the ensemble? Again, is there a limiting distribution 
when the matrix size tends to infinity? Or, similarly, 
how do the gaps between adjacent eigenvalues fluctu- 
ate around their mean? How, if at all, are eigenvalues 
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Figure 1 The spectrum of a randomly chosen complex 
Hermitian (GUE) matrix of dimension N = 20. 

correlated on the scale of their mean spacing? How 
are other functions over the ensemble distributed? For 
example, the values of the characteristic polynomials 
of the matrices, their traces, their condition numbers, 
the sizes of the largest and smallest eigenvalues? 

Third, how do the answers to these questions depend 
on the particular choice of probability measure intrin- 
sic to the definition of a given ensemble, and on the 
symmetries (if any) of the matrices involved? If they 
depend on the choice of the probability measure, then 
how do they? Is there a class of probability measures 
for which they can be said to be universal in an appro- 
priate limit? Are there choices of probability measure 
that are in some sense exactly solvable? 

Fourth, given answers to questions such as these, 
how can they be used to shed light on applications? 

Our aim here is to give an introductory overview of 
some of these issues. 

In order to motivate and illustrate the theory, it may 
be helpful to anticipate it with the results of some 
numerical computations. Figure 1 shows the spectrum 
of a complex Hermitian matrix of dimension N = 20, 
picked at random from the GUE by generating normal 
random variables to fill the independent matrix entries. 
One sees that the spectrum is denser at the center than 
at the edges and that the eigenvalues do not often he 
close to each other. 

To illustrate these points further, figure 2 shows the 
spectral measure (i.e., the local eigenvalue density; see 
equation (1) below) of a GUE matrix of dimension N = 
500 000, with the eigenvalues divided by VN. Figure 3 
shows a similar plot with data averaged over the spec- 
tra of 2x 10 6 matrices with N = 500. One sees that after 
dividing by -JN these spectra appear to lie between -2 
and +2, that the picture for an individual matrix is very 
similar (indeed, on the scale used, figure 2 is indistin- 
guishable from figure 3) to that of an average over a 
large number of matrices, and that the eigenvalue den- 
sity appears to have a simple form that is well described 
by the result of an analytical random-matrix calculation 
(represented by curves) to be explained later. Similarly, 
figures 4 and 5 show the probability densities of the 
spacings between adjacent eigenvalues, further scaled 
to have unit mean separation (see below), for individ- 
ual GOE and GUE matrices of dimension N = 500 000 



Figure 2 The spectral measure (i.e., eigenvalue density (see 
(1))) of a GUE matrix of dimension N = 500 000, with the 
eigenvalues divided by VN (crosses). The curve is a predic- 
tion of random-matrix theory, which is described later in 
the text. 



Figure 3 The average of the spectral measures of 2 x 10 6 
GUE matrices with dimension N = 500, with the eigenvalues 
divided by -JN (crosses). The curve is a prediction of ran- 
dom-matrix theory, which is described later in the text. Note 
that, on the scale shown, these data look identical to those 
in figure 2 despite the fact that they are in fact different (see 
section 7). 

and separately averaged over GOE and GUE matrices 
of dimension N = 500, respectively. One again sees 
a marked similarity between the behavior of individ- 
ual matrices and that of ensemble averages. In both 
the GOE and GUE examples, the probability density 
vanishes in the limit of small spacings; that is, the 
eigenvalues behave as if they repel each other. Note 
that the degree of the repulsion differs between the 
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Figure 4 The probability density of the spacings between 
adjacent eigenvalues for the GUE spectrum represented in 
figure 2 (asterisks, black line) (when these eigenvalues are 
scaled to have unit mean spacing), and the corresponding 
distribution for a GOE matrix of the same dimension (dia- 
monds, gray line). The two curves represent the correspond- 
ing random-matrix formulas, which are described later in 
the text. 

two ensembles. Once more, these figures also show the 
results of analytical random-matrix calculations to be 
explained later. 

2 History 

The origins of random-matrix theory go back to work 
on multivariate statistics by Wishart, who introduced 
the Wishart ensemble in 1928 in his calculation of 
the maximum likelihood estimator for the covariance 
matrix of the multivariate normal distribution. Wishart 
worked at the Rothamsted Experimental Station, an 
agricultural research facility. 

Many of the most significant developments in the 
held were stimulated by ideas of Wigner in the 1950s 
that related to nuclear physics. Wigner was interested 
in modeling the statistical distribution of the energy 
levels of heavy nuclei. He introduced the Wigner ensem- 
ble and the invariant Gaussian ensembles; calculated 
the mean eigenvalue density in these cases; and started 
to investigate spectral correlations, particularly the dis- 
tribution of spacings between adjacent eigenvalues. In 
focusing on heavy nuclei, Wigner had in mind applica- 
tions to many-body quantum systems, that is, systems 
with a large number of degrees of freedom. 

In the 1960s Wigner’s program was developed into 
a systematic area of mathematical physics by Dyson, 



Figure 5 The probability density of the spacings between 
adjacent eigenvalues, averaged over the GUE spectra repre- 
sented in figure 3 (asterisks, black line) (when these eigen- 
values are scaled to have unit mean spacing), and the corre- 
sponding distribution for 2 x 10 6 GOE matrices with dimen- 
sion N = 500 (diamonds, gray line). The two curves repre- 
sent the corresponding random-matrix formulas, which are 
described later in the text. Note that, on the scale shown, 
these data look identical to those in figure 4, despite the 
fact that they are in fact different (again see section 7). 

Gaudin, and Mehta, who developed techniques for cal- 
culating eigenvalue statistics in the Gaussian and circu- 
lar ensembles, and Marcenko and Pastur, who extended 
the analysis to the Wishart ensemble. 

The idea of applying random-matrix theory to sys- 
tems with a small number of degrees of freedom, in 
which complexity arises due to the internal dynamics, 
arose in the field of quantum chaos in the late 1970s 
and early 1980s in the work of Berry, Bohigas, and their 
coworkers. Here, the philosophy is that, when the clas- 
sical dynamics of a system is chaotic, the quantum 
dynamics may manifest that in the semiclassical limit 
(i.e., in the limit as the de Broglie wavelength tends 
to zero) by the corresponding matrix elements behav- 
ing like random variables. This same philosophy imme- 
diately generalizes, mutatis mutandis, to other wave 
theories such as optics, acoustics, and so on. 

The years since 1990 have seen a rapid devel- 
opment of interest in connections between random- 
matrix theory and, among other subjects, quantum 
held theory, high-energy physics, condensed matter 
physics, lasers, biology, finance, growth models, wire- 
less communication, and number theory. There has 
also been considerable progress in proving universality 
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and establishing links v\ith the theory of integrable 
systems. 


3 Eigenvalue Density 


In the case of matrix ensembles in which the eigen- 
values are real (e.g., Hermitian or real symmetric matri- 
ces), one can ask how many lie in a given interval. Let us 
consider first the case of N x N Wigner random matri- 
ces X. It turns out that when the matrix elements have 
finite second moment, the eigenvalues Af (X) are, with 
extremely high probability, of the order of y/N. This 
fact is illustrated in figures 2 and 3. Hence, defining 
the spectral measure by 


A N (z;X)=± X 5(z 

n-1 


Af(X)' 

y/N , 


where S(x ) is the Dirac delta function, the proportion 
of eigenvalues (normalized by y/N) lying in the interval 
[a, b] is 

A?(X) e y/N[a,b]} = C A N (z\X) dz. 

Jv Ja 

As N — oo, Ax converges, in the distributional sense, 

to a limiting function of z. Specifically, Ax converges 

weakly to the semicircle law u given by 


dcr(z) = 


when -2 ^ z ^ 2, and by 0 when |z| > 2. 

This was first established for convergence in prob- 
ability by Wigner in 1955 under the condition that all 
the moments of the matrix elements are finite. How- 


ever, it has subsequently been refined so that conver- 
gence holds almost surely and under the condition that 
the matrix elements have finite second moment. The 


semicircle law is the curve plotted in figures 2 and 3. 

In the case of general Wigner matrices, one can estab- 
lish the semicircle law by considering moments of 
traces of powers of X. In the special cases of the GOE 
and the GUE, one can also use techniques that rely on 
the invariance properties of the measure (see below) to 
obtain precise estimates for the rate of convergence. 
In the case of the other invariant ensembles, one can 


also establish a limiting density, but typically this is 
not semicircular; it may be calculated from the specific 
form of the probability. 

For the Wishart ensemble and its generalizations, 
which play a major role in many applications, the ana- 
logue of the Wigner semicircle law is known as the 
Marcenko-Pastur law. 


For the circular ensembles and matrices from the 
classical compact groups, the analogue of the semicir- 
cle law is that in the large matrix limit, the eigenvalues 
become uniformly dense on the unit circle (on which 
they are constrained to lie by unitarity). 

In the case of non-Hermitian matrices, one can sim- 
ilarly calculate the density of eigenvalues in the com- 
plex plane. For example, in the Ginibre ensemble, the 
eigenvalues A 1 / (X), when divided by y/N , have a limit- 
ing density that is given by the uniform measure on the 
unit disk. 

4 Joint Eigenvalue Distribution 

A more refined question about the eigenvalues A^ (X) 
of a random matrix X than that of their limiting den- 
sity relates to the probability that they all lie in some 
given set S in the complex plane or, when they are con- 
strained to be real, on the real line. For definiteness, 
we will focus here on the latter case. Essentially, the 
issue is then to calculate the probability that the first 
eigenvalue lies in the interval (xi.xi + dxi), the sec- 
ond in the interval (X 2 , X 2 + dx 2 ) , the third in the inter- 
val (X 3 ,X 3 + dx 3 ), etc. In general, this probability is 
hard to derive in a useful form. For the invariant ensem- 
bles, however, it can be computed. The idea is that the 
probability measure defined in terms of the matrix ele- 
ments can be reexpressed in terms of the eigenvalues 
and the eigenvectors. In the invariant ensembles, the 
dependence on the eigenvectors may be straightfor- 
wardly integrated out, leaving just the dependence on 
the eigenvalues. The result is proportional to the prod- 
uct of a Vandermonde determinant factor |det(V)|^, 
with Vij = x/ 1 and therefore 

I det ( V ) | ^ = n I xj-xtf, 

and a factor w(x\,...,xx) coming from the precise 
form of the probability measure associated with the 
matrices. Here, /? = 1 for orthogonal invariant ensem- 
bles (e.g., the GOE) and /3 = 2 for unitary invariant 
ensembles (e.g., the GUE). (/) = 4 for quaternionic matri- 
ces.) The Vandermonde factor comes from the Jaco- 
bian associated with the change of variables from the 
matrix elements to the eigenvalues. For the Gauss- 
ian ensembles the weight factor is w(x\, . . . ,xx) = 
exp( — qj3Zn=i x h)- In the case of the Ginibre ensem- 
ble one has a similar result, but the eigenvalues are 
generally complex. For the circular ensembles, in which 
the eigenvalues lie on the unit circle, the Vandermonde 
factor takes the form rii<i</<ivl el0/ - e l6i |^ and the 
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weight factor w is a constant. The joint eigenvalue dis- 
tribution for the Wishart ensemble has a similar struc- 
ture, but it also reflects the fact that the eigenvalues are 
nonnegative. 

Perhaps the most significant consequence of the 
structure of the joint eigenvalue distribution functions 
outlined in the previous paragraph is that the eigen- 
values of random matrices behave as if they repel each 
other. This follows from the Vandermonde factors, 
which vanish as any two eigenvalues approach each 
other, so the probability of having a pair of eigenvalues 
in close vicinity vanishes as their separation is reduced. 
Importantly, the rate of vanishing depends only on the 
symmetry of the matrix ensemble (via the value of fl) 
and not on the detailed form of the probability mea- 
sure (which determines the multiplying weight factor). 
By way of contrast, independent random numbers, for 
which /? = 0, do not share this behavior. 

A key feature of the formulas for the joint eigenvalue 
distribution functions for the invariant ensembles is 
that they have a determinantal structure. This is the 
starting point for the analysis of the statistical distri- 
bution of the eigenvalues. The essential idea is that, by 
forming linear combinations of the rows and columns 
of V, one can generate polynomials in the variables 
x n that are orthogonal [11.29] with respect to the 
(ensemble-dependent) weight w. This often allows inte- 
grals involving the joint eigenvalue distribution func- 
tions to be computed, either exactly or asymptotically. 
In the case of the GUE, this goes through straight- 
forwardly for any matrix size N (the polynomials in 
this case being the classical Hermite polynomials). The 
GOE requires more sophisticated analysis, but again 
the theory can be developed for any matrix size. In 
both cases the semicircle law can be obtained via this 
approach by integrating over all but one of the vari- 
ables in the joint eigenvalue distribution function. The 
CUE and COE can also be analyzed in this way. In the 
case of more general invariant ensembles, the orthog- 
onal polynomials that arise are nonclassical, and exact 
calculations for a finite matrix size are difficult. How- 
ever, the large-JV asymptotics can be evaluated by an 
application of the steepest-descent analysis for the 
Riemann-Hilbert problem. 

5 Eigenvalue Statistics 

One theme that has been central to random-matrix 
theory has been to seek statistical measures that, 
unlike the full joint eigenvalue distribution, have a well- 
defined limit as AT — oo. In order to compare different 


ensembles, and different parts of the spectra for a given 
ensemble, it is natural to rescale (or unfold) the eigen- 
values to have unit-mean nearest-neighbor spacing. If 
we denote the mean eigenvalue density by p( A) (e.g., in 
the case of the Gaussian ensembles this would be the 
semicircle law for the range - 2 -JN ^ A ^ 2 VN), then 
the scaled eigenvalues A^ (X) are given by 
r Af(X) 

A f(X) = N\ p( A)dA. 

One can then seek to calculate the probability distri- 
bution of the spacings between adjacent scaled eigen- 
values, correlations between pairs (or more generally 
n-tuples) of eigenvalues, and more exotic measures of 
the eigenvalue distribution. 

For the invariant ensembles, one can compute the 
eigenvalue statistics from the joint eigenvalue distribu- 
tion function using the method of orthogonal polyno- 
mials. For example, the correlation function for pairs of 
scaled eigenvalues in the bulk of the spectrum (i.e., far 
from edges, such as ±2 in the case of the GOE and GUE) 
may be computed by integrating over all but two of the 
variables. The result encapsulates the eigenvalue repul- 
sion discussed above, i.e., that the probability of find- 
ing pairs of eigenvalues close together vanishes with 
their separation, in a way determined by the symmetry 
of the ensemble. Higher correlation functions, involv- 
ing w-tuples of scaled eigenvalues, may be computed 
in the same way. 

Significantly, all correlation functions may be ex- 
pressed in determinantal form. From the correlation 
functions one can then deduce, via a combinatorial cal- 
culation, the probability that nearest-neighbor eigen- 
values in the bulk of the spectrum have a given spac- 
ing. The result may be expressed either as a Fred- 
holm determinant or as a solution of a painleve 
equation [III.24]. The probability of finding nearest- 
neighbor scaled eigenvalues a distance 5 apart vanishes 
like 5^, where /? = 1 for orthogonal invariant ensem- 
bles (e.g., the GOE) and f = 2 for unitary invariant 
ensembles (e.g., the GUE). This is in contrast to the 
spacings between uncorrelated random numbers (i.e., 
generated by a Poisson process and unfolded like the 
eigenvalues), for which the corresponding probability 
is e~ s , and so increases as 5 — 0. The GOE and GUE 
spacing distributions are plotted in figures 4 and 5. 

From the correlation functions one can also char- 
acterize spectral fluctuations over longer ranges. For 
example, in a given interval of length L one expects 
to find, on average, L scaled (i.e., unfolded) eigen- 
values. For each matrix the actual number will fluctuate 
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around L. For random matrices the variance of these 
fluctuations is proportional to log I as I — oo, while 
for uncorrelated random numbers the variance is L. 
The much smaller variance in the random-matrix case 
demonstrates a distinctive rigidity in the spectrum. 

In the case of the Gaussian ensembles, the spectral 
statistics can be computed explicitly for any size of 
matrix. The formulas simplify considerably in the large- 
matrix limit. The expressions that emerge in that limit 
are universal for invariant ensembles of the same sym- 
metry type; that is, they do not depend on the specific 
probability weighting (unlike for the mean density, the 
dependence on which is counterbalanced by unfolding 
the spectrum). 

This universality extends beyond the invariant en- 
sembles to Wigner random matrices, which, in the 
large-matrix limit, have the same spectral statistics as 
the invariant ensembles, independently of the specific 
choice of distribution of the matrix elements. In this 
case the method of orthogonal polynomials is not avail- 
able, and to establish the result one proceeds indirectly, 
by showing that the spectral statistics of a Wigner 
matrix are sufficiently close to those of some GOE/GUE 
matrix. 

Proving universality for the invariant and Wigner 
ensembles has been one of the most significant devel- 
opments of the past two decades. Another important 
development is the determination of the statistics of 
eigenvalues at a spectral edge. For example, one can 
determine the distribution of the largest eigenvalue of 
a GOE/GUE matrix. Typically, this will lie close to the 
upper limit of the range in which the semicircle law 
applies. The issue then is to establish the scale and 
nature of the fluctuations around this point. This prob- 
lem was solved by Tracey and Widom, who showed that 
the distribution in question is also given in terms of the 
solution of a Painleve equation. 

6 Characteristic Polynomials 

An alternative way of representing spectral statistics is 
via the value distribution of the characteristic polyno- 
mials of random matrices. For example, if X denotes an 
NxN unitary matrix, its characteristic polynomial may 
be denoted p(s;X) = det(Is - X). The eigenvalues of 
X are the zeros of this polynomial and so, by unitarity, 
they lie on the unit circle in the complex 5-plane. On 
the unit circle one can determine the moments of, for 
example, \p(s;X)\, with respect to an average over X 
in the CUE, by representing p(s\X) as a product over 


the eigenvalues and then integrating over these using 
the joint eigenvalue distribution. Remarkably, the mul- 
tiple integrals in question can be evaluated exactly by 
relating them to an integral computed by Selberg. From 
these moments one can prove that the values of the 
real and imaginary parts of log p(s\X), when divided by 
( | log N) 1/2 , are independent and normally distributed 
(with mean 0 and variance 1 ) in the limit as N — oo . Sim- 
ilar results hold for other ensembles of random matri- 
ces. The analysis of the value distribution of charac- 
teristic polynomials has been central to applications of 
random-matrix theory to number theory. 

7 Ergodicity 

The primary strategy in random-matrix calculations is 
to fix a point in the (scaled/unfolded) spectrum and 
average with respect to an ensemble. That is, one cal- 
culates average properties for a class of matrices. Cru- 
cially, however, in many cases of interest it can be 
proved that this averaging procedure is ergodic in the 
limit of large matrix size. That is, for a typical large 
matrix, the difference between the fluctuation statis- 
tics obtained by averaging over its spectrum and those 
calculated from the ensemble average vanishes in the 
limit as N — • oo. To be more specific, if one considers 
sequences of matrices of increasing size, the probabil- 
ity of encountering a sequence for which spectral aver- 
ages do not converge to the ensemble average vanishes 
as AT — oo. In this sense, the ensemble averages one 
computes describe the properties of typical individual 
large matrices. 

8 Connections 

There are many significant connections between ran- 
dom-matrix theory and other areas of applied mathe- 
matics. I briefly note the following examples. 

First, the fact that some ensembles are exactly solv- 
able using orthogonal polynomials, and that the results 
can be expressed in determinantal form and are related 
to Fredholm theory and Painleve transcendents, is 
indicative of deep underlying connections with the 
theory of integrable systems. Much of the theory for 
this has been developed in the last twenty years. 
The powerful techniques associated with integrable 
systems, such as the inverse scattering method and 
Riemann-Hilbert methods, have played a central role 
in the modern development of the subject. 
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Second, writing the Vandermonde factor in the joint 
eigenvalue distribution as 

n 1*7 - x i^ = exp (p X log \Xj — V 

1 <i<KN ' 1 ^i<j^N 

one sees that it takes the form of a Boltzmann weight 
associated with a one-dimensional gas of particles with 
positions x*, interacting via the Coulomb potential 
Zi*i<y<A7l°g I x j ~ x il< at a temperature proportional 
to ft- 1 . This Coulomb gas analogy allows one to cal- 
culate many spectral properties using methods from 
equilibrium statistical mechanics. The connection is 
reinforced by an important idea due to Dyson: one 
can interpret the Xj as positions of particles undergo- 
ing Brownian motion. They therefore satisfy stochas- 
tic partial differential equations, the precise forms of 
which (in particular, the conditioning on the solutions) 
depend on the ensemble in question. This picture has 
been further refined by relating the motion of the eigen- 
values to nonintersecting random walks, allowing many 
detailed properties of the spectrum to be determined 
using techniques from probability theory. 

Other important connections— to free probability 
theory, between products of random matrices and 
Anderson localization of waves in random media, to 
random growth models, to enumerative geometry and 
topology, and to random permutations, for example- 
lie outside the scope of this article. 

9 Applications 

As mentioned in the introduction, random-matrix the- 
ory has applications in virtually every area where matri- 
ces play a role, and in many others too. I outline below 
a few representative examples. 

9. 1 Condition Numbers 

condition numbers [IV.10 §1] are of fundamental sig- 
nificance in characterizing the stability of numerical 
computations involving linear algebra. The question of 
determining the condition number of a random matrix 
was raised by von Neumann and Goldstine, and refined 
by Smale, in the context of characterizing the typical 
size of the condition number. For example, consider an 
N x N matrix X with elements from a standard nor- 
mal distribution. The 2-norm condition number kx is 
the square root of the ratio of the largest eigenvalue 
of the (Wishart) matrix X T X to its smallest eigenvalue. 
In this context the question was resolved by Edelman 
in 1988, who determined the distribution of the largest 


and smallest eigenvalues of X r X from the joint eigen- 
value distribution for the Wishart ensemble. The result 
is that the expected value of log kx increases like logN 
when N — ■ co and that the distribution of kx/N has a 
limit when N — ■ co that can be written explicitly. The 
analysis extends to rectangular matrices. 

9.2 Analysis of Large Data Sets 

One of the canonical problems of the analysis of 
time series is that of determining correlations from a 
finite data set (see, for example, portfolio theory 
[V.10 §2.2]). Consider the situation of N fluctuating 
quantities sampled at T points in time. Let us denote 
these quantities by q l n , with 1 ^ n ^ N and 1 < t sS T. 
For example, these might be N stocks sampled on each 
of T days, or N physiological variables measured at T 
separate times. A key goal is to approximate any corre- 
lations underlying the dynamics from the finite sample 
of data available. Defining Xt n = gh IVT, so that 

Eij =lX dWj = (X T X)ij, 

t = l 

the question is how well the empirical correlation 
matrix £y represents any genuine underlying correla- 
tions. One might anticipate that the empirical correla- 
tions will be representative when N is fixed and as T 
grows. However, in many applications, N may not be 
small compared with T. This is particularly the case in 
financial data, in data from social science experiments, 
and in biological data. In this case it is natural to use 
random Wishart matrices to establish the typical differ- 
ence between the limit r = N/T -> 0 and when r » 1. 
The Marcenko-Pastur law is a key tool as it describes 
the density of eigenvalues as a function of r. An impor- 
tant feature of this law is that, like the semicircle law, 
it is characterized by having edges. Thus the (Tracey- 
Widom) distribution of the largest eigenvalue near to 
the (soft) edge at the upper end of the spectrum and 
the corresponding distribution for the smallest eigen- 
value (near to the hard edge at 0, associated with the 
nonnegativity of the eigenvalues) play central roles. 

Applications to financial data often require one to 
consider random matrices where the distribution of the 
matrix elements is highly non-Gaussian, e.g., those pos- 
sessing heavy tails, and where the matrix elements may 
by correlated. These remain major challenges. 

9.3 Quantum Chaos 

Wigner’s contributions in the 1950s were motivated 
by the need to understand the quantum spectra of 
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heavy atomic nuclei. This is a many-body problem in 
which the nucleons interact strongly with each other. 
The quantum Hamiltonian (i.e., the matrix whose eigen- 
values are the energy levels) may thus be expected to be 
a complex object. Wigner’s insight was to realize that 
statistical properties of the spectrum could be modeled 
by those of random matrices and that these could be 
determined by averaging over ensembles of matrices. 
This resembles the methodology of classical statistical 
mechanics, where one models statistical properties of 
solutions of the equations of motion for many interact- 
ing particles by averaging over all possible states or tra- 
jectories, subject to conservation of energy, etc., with 
each state/trajectory given its Boltzmann weight. 

The same philosophy also underpins the applications 
of random-matrix theory to condensed matter physics. 
There, one often encounters the problem of an electron 
moving in a disordered medium, that is, in a medium 
where the forces behave like a random function of posi- 
tion. In this case, too, it is natural to model the quantum 
Hamiltonian by a random matrix. 

One of the key ideas of chaos theory is that one does 
not need many interacting particles or random forces 
to generate complex dynamics. Instead, this can be (and 
typically is) found in systems with a small number of 
degrees of freedom where the dynamics is not max- 
imally constrained by the symmetries but where the 
forces acting may, nevertheless, be simple. In such sys- 
tems, typical long trajectories may also be ergodic, in 
that their statistical properties coincide with averages 
of ensembles of appropriately weighted trajectories 
and consequently with phase-space averages. Examples 
include the three-body problem [VI. 16] and billiards 
(i.e., a freely moving particle making specular reflec- 
tions at boundaries) in domains that are neither rectan- 
gular nor elliptical (more generally, in which the dynam- 
ics is not integrable). In the 1970s and 1980s it was sug- 
gested that the quantum eigenvalue statistics in such 
systems should, on the scale of the mean level spacing, 
be modeled by random-matrix theory. Specifically, this 
should hold in quantum systems whose classical limit 
is chaotic, in the semiclassical limit as Planck’s constant 
ft — - 0, or, physically, in the limit of vanishing de Broglie 
wavelength. For example, in the case of billiard sys- 
tems it should hold for the high-lying eigenvalues of 
the Laplacian with appropriate boundary conditions. 

The idea that random-matrix theory should, in the 
semiclassical limit, describe quantum spectral statis- 
tics in classically chaotic systems has its origins in work 
of Berry and Tabor in 1977 (whose primary focus was 


on classically integrable systems, for which the energy 
levels are generically believed to have Poisson statistics, 
but who also speculated on chaotic systems) and was 
developed into a precise conjecture by Bohigas, Gian- 
noni, and Schmit in 1984. It has been verified numeri- 
cally in a very wide range of systems and is believed 
to hold generically (it can be subverted by the pres- 
ence of symmetries). A related conjecture, due to Berry 
in 1977, is that the eigenfunctions behave like linear 
superpositions of randomly directed plane waves in 
the semiclassical limit and so exhibit the statistical fea- 
tures of Gaussian random functions. Taken together, 
these conjectures underpin the statistical modeling of 
quantum chaotic systems. However, neither has been 
proved mathematically in any system, and achieving 
this remains one of the outstanding open problems in 
the field. 

While we may not have a proof of the random- 
matrix conjecture for quantum chaotic systems, it is 
supported by a highly sophisticated and subtle heu- 
ristic semiclassical analysis based on a relationship 
between quantum spectra and classical periodic orbits. 
This relationship emerges from a saddle-point evalu- 
ation, valid in the semiclassical limit, of the Feynman 
path integral representation of the energy-dependent 
Green function in terms of a sum over all possible paths 
weighted by e lS( Path)/^ where S denotes the action. The 
saddle-point condition picks out those paths for which 
the action is stationary, that is, the classical trajecto- 
ries. As first shown by Gutzwiller, a further applica- 
tion of the saddle-point method in evaluating the trace 
of the Green function, which determines the spectrum, 
selects out the periodic (i.e., self-retracing) orbits. Sta- 
tistical properties of the quantum spectrum are thus 
linked semiclassically to statistical properties of the 
classical periodic orbits and hence to the chaotic nature 
of the classical dynamics. It was first demonstrated 
in this way by Hannay and Ozorio de Almeida, and 
later in more generality by Berry, that ergodicity of the 
classical dynamics governs some key features of the 
pair correlation function of the quantum spectrum and 
that these coincide with the corresponding features of 
the random-matrix results. This approach has subse- 
quently been refined by a number of researchers, par- 
ticularly Bogomolny and Keating, who extended it to 
all the main key features, and Sieber and Richter, who 
introduced the seminal idea of including pairs of orbits 
that experience a close encounter, which turn out to 
make a significant contribution. 
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To gain complete agreement with random-matrix 
expressions using semiclassical periodic orbit formu- 
las requires either the use of sophisticated resumma- 
tion techniques introduced by Berry and Keating, which 
relate long orbits to shorter ones, or a knowledge of 
the correlations between the actions of distinct pairs of 
long orbits. These correlations are not known a priori, 
but they can be shown to be equivalent to the random- 
matrix conjecture. In this sense, as shown by Argaman 
et al., the random-matrix conjecture for quantum spec- 
tra leads, remarkably, to an important prediction relat- 
ing to the classical dynamics in chaotic systems. It is 
a major open problem to verify this prediction using 
classical mechanics. 

Another key theme in quantum chaos has been the 
analysis of open chaotic systems; that is, scattering 
systems in which the interactions generate exponen- 
tial instability in the dynamics. If the system is weakly 
open, in the sense that trajectories are trapped for a 
long time before escaping, quantum statistical proper- 
ties (e.g., of the scattering resonances) can be modeled 
by random matrices that are close to being Hermitian. 
In chaotic systems that are strongly open, the semi- 
classical analysis is much less developed and remains 
a major challenge. The semiclassical theory of quan- 
tum chaotic scattering was developed by Smilansky and 
coworkers. 

It is worth emphasizing that the ideas concerning 
quantum chaos outlined here apply in general to all 
complex wave problems and so have applications to 
the statistical analysis of optical, acoustic, and vibra- 
tional systems, as well as to essentially quantum phe- 
nomena (e.g., to lasers, superconductors, the motion of 
electrons in atomic, molecular, and solid-state systems, 
nuclei, etc.). 

Further Reading 

Among the suggestions for further reading below, 
Mehta’s book is the classic text on random-matrix 
theory; the book edited by Akemann, Baik, and Di Fran- 
cesco contains excellent review articles covering much 
of the material included in this article (and more) and is 
an ideal first point of contact with the subject; Porter’s 
book contains reprints of many of the important early 
physics papers in the subject; Haake’s book is a useful 
introduction to quantum chaos; and the book edited 
by Wright and Weaver contains review articles aimed 
at applications to linear acoustics and vibration. 
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IV.25 Kinetic Theory 

Cedric Villani and Clement Mouhot 


1 The Birth of Kinetic Theory 

Modern physics can be traced back to Newton and 
the advent of differential equations to substantiate the 
laws of classical mechanics. In the following centuries 
this was followed by more comprehensive theories of 
physical phenomena in the surrounding world: elec- 
tric and magnetic forces were captured by the theory 
of electromagnetism (Ampere, Faraday, Maxwell); large 
velocities were handled by the theory of relativity 
(Lorentz, Poincare, Minkowski, Einstein); small-scale 
particle physics could be taken care of by quantum 
mechanics (Planck, Einstein, Bohr, Heisenberg, Born, 
Jordan, Pauli, Fermi, Schrodinger, Dirac, de Broglie, 
Bose); and so on. 
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However, all of these theories are classically devised 
to study one physical system (a planet, a ship, a motor, 
a battery, an electron, a spaceship, etc.) or a small num- 
ber of systems (planets in the solar system, electrons 
in a molecule, etc.). In many situations, though, one 
needs to deal v\ith an assembly made up of elements so 
numerous that their individual tracking is neither use- 
ful nor possible: galaxies made up of hundreds of bil- 
lions of stars, fluids made of more than 10 20 molecules, 
crowds made of thousands of individuals, etc. Tak- 
ing such large numbers into account leads to new 
effective laws of physics, requiring different models 
and concepts. This passage from microscopic rules to 
macroscopic laws is the founding principle of statisti- 
cal physics. All branches of physics (classical, quantum, 
relativistic, etc.) can be studied from the point of view 
of statistical physics, in both stationary and dynamical 
perspectives. Classical mechanics was, naturally, one of 
the first laboratories for statistical physics, and thus in 
the nineteenth century kinetic theory was born. 

Before we describe the key concepts of kinetic theory, 
let us recall the basic notion of phase space, which 
should be thought of as the space of all possible states 
occurring in a mathematical model of some physical 
system. If one studies a deterministic system obeying 
an evolution equation, then the phase space is, in prin- 
ciple, the “smallest” space in which the equation deter- 
mines a unique “well-behaved” solution. For instance, 
the evolution of a classical point particle is governed by 
a second-order differential equation (Newton’s law); so 
the position of the particle is not sufficient to predict 
its future positions, but the pair (position, velocity) is 
sufficient to predict future positions and velocities. The 
phase space of a classical particle is therefore made up 
of positions and velocities. On the other hand, if the 
physical system is, for instance, a rigid body with a cer- 
tain shape, then the phase space should also include 
extra parameters related to the orientation of the body. 

The main idea in kinetic theory is to replace a huge 
number of objects, whose physical states are com- 
pletely described by points in a certain phase space and 
whose properties are otherwise identical, by a statistical 
distribution over that phase space. In particular, a large 
crowd of classical point particles will be described by 
a statistical distribution on the space of positions and 
velocities. 

In retrospect, the conceptual leap from Newtonian 
mechanics to kinetic theory was quite significant: the 
new formalism involved a set of invisible variables, 
namely, the velocities of particles, that are inaccessible 


to observation. It was even counterintuitive; for in- 
stance, kinetic theory replaces the model of a fluid at 
rest (zero velocity) by a huge number of particles mov- 
ing in all directions with great speed. This increase in 
complexity was not easy to justify, since at the time 
there was no way to measure any of these velocities— it 
is still barely possible today. This fundamental role of 
velocities accounts for the name kinetic theory. 

With kinetic theory came the distinction between 
three scales: the macroscopic scale of phenomena that 
are accessible to observation, the microscopic scale of 
molecules and infinitesimal constituents, and an inter- 
mediate scale that is loosely defined and is often called 
mesoscopic. This is the scale of phenomena that are 
not accessible to macroscopic observation but already 
involve a large number of particles, so that statistical 
effects are significant. 

With only a little stretch of the imagination, one can 
liken the principles of kinetic theory to those of cer- 
tain contemporary models of theoretical physics, such 
as string theory, in which a set of hypothetical hidden 
variables is also taken into account (here we put aside 
any debates about the value of, and the possibility of 
validating, string theory). 

The basic scheme of kinetic theory leaves room for 
obvious variations. If there are several species, one can 
consider several statistical distributions. (Think of air, 
which is mainly made up of a mixture of two gases; the 
two species have different properties, but within each 
species the molecules can be considered as identical.) If 
the position and velocity are not sufficient to describe 
the state of one object, one can enlarge the phase space. 
(In the case of air, one might wish to keep track of the 
orientation of a molecule of nitrogen or oxygen.) 

Kinetic theory was first discussed by Daniel Bernoulli 
in the eighteenth century. The notions of mean free 
path and mean free time — which are the typical dis- 
tance and typical time, respectively, that a particle can 
travel without hitting another particle — were studied 
by various authors (Herapath, Waterston, Joule, Konig, 
Clausius) between 1820 and 1860. At the same time, 
the very important notion of cross section emerged; this 
measures the likelihood of interaction between two par- 
ticles, and it can be interpreted as an effective colli- 
sion surface. The field as we know it, though, was really 
founded by Maxwell in a celebrated paper of 1867. 

This theory was strongly influenced by two major ear- 
lier scientific developments. The first was the rise of 
thermodynamics throughout the eighteenth and nine- 
teenth centuries. The laws governing exchanges of 
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energy and variations of heat, density, pressure, and 
temperature did not seem to rely on fundamental equa- 
tions and were discovered through a slow and confus- 
ing process; it was therefore desirable to grasp some 
more fundamental laws that would underlie thermo- 
dynamics. The second influence was the development 
of statistics, especially in the field of social sciences, 
with the empirical discovery by Galton and Quetelet 
of the omnipresence of simple statistical laws derived 
from probability theory— one prime example being the 
recognition that fluctuations in the size of individu- 
als was essentially governed by Gaussian distributions. 
Both of these influences guided Bernoulli (whose father 
was one of the founders of probability theory), but 
by the time of Maxwell they had become much more 
mature. 

In those days, the atomistic nature of matter was still 
largely hypothetical, and kinetic theory could be con- 
sidered a thought experiment. Maxwell discussed the 
problem of the derivation of macroscopic laws from 
microscopic physics. He worked in a dilute regime to 
neglect collisions involving more than two bodies, and 
he assumed a clear separation between the inhomo- 
geneity scale (a mesoscopic concept) and the interac- 
tion scale (at microscopic level). He then computed the 
effect of collisions on the distribution function via the 
solution of a classical scattering problem. In this way he 
came up with an evolution equation, equivalent to what 
we now call the Boltzmann equation. The unknown is a 
density function f(t,x,v), standing for the density of 
particles at time t in the phase space (x, v) (equipped 
with the reference Liouville measure dx du); and the 
equation, in modern notation, is 

| £+v-V*/ + F(t,x)-V v / = Q(/,/). (1) 

Here, the left-hand side describes the evolution of f 
under the action of the force F(t,x), while the action of 
elastic collisions is described by the nonlinear operator 
Q on the right-hand side: 

Q(/,/ ) = [ f B(v - v*,a>) 

Jm 3 Js 2 

X (f(t,x,v')f(t,x, l4) 

- f(t,x,v)f(t,x,v*)) dv* dot. 

( 2 ) 

Note that this operator is localized in t and x, it is quad- 
ratic, and it has the structure of a tensor product with 
respectto f(t,x, ■)■ The velocities v' and v * should be 
thought of as the velocities of a pair of particles before 
collision, while v and u* are the velocities after that 


collision; the formulas are 

v' = v - (v — u*,(o)tt>, v# = t>* + (v - u*,co)tt>. 

When one computes ( v , u*) from (v\ v*) (or does the 
reverse), conservation laws are not enough to yield the 
result, with only four scalar conservation laws for six 
degrees of freedom. The unit vector m e S 2 removes 
this ambiguity; in the case of colliding hard spheres, 
it can be thought of as the direction of the line joining 
the centers of the two particles. The kernel B( v - v* , to) 
describes the relative frequency of vectors to, depend- 
ing on the relative impact velocity v - u*; it depends 
on only the modulus | v - v* | and the deflection angle 
6 between v - u* and v' - v *. Maxwell computed it 
for hard spheres (B ~ \v - f*| sind) and for inverse 
power forces. In the latter case, the kernel factorizes 
as the product of |u - v*| y with a function b(9)\ 
Maxwell showed that, if the force is repulsive, propor- 
tional to r~ s (r being the interparticle distance), then 
y = (s - 5 )/(5 - 1) and b(0) - 0 -A+v) as 0 — 0, where 
v = 2 / (s — 1). In particular, the kernel is usually nonin- 
tegrable as a function of the angular variable: this is a 
general feature of long-range interactions and is nowa- 
days called the “noncutoff property.” Maxwell further 
noticed that the inverse power 5 = 5 leads to simplified 
formulas, which could lend themselves to more explicit 
computations. 

Maxwell went on to discuss possible boundary con- 
ditions. Particles arriving at a point x in the boundary 
with velocity v may be assumed to acquire a new veloc- 
ity R x v, determined either by the model of specular 
reflection (R x v = v — 2v ■ n x n x , where n x is the unit 
ingoing normal vector at x) or by the model of bounce- 
back reflection (R x v = -v). Either way, the boundary 
condition reads f(x,R x v) = f(x,v). Inmore sophisti- 
cated models, particles are assumed to be absorbed by 
the boundary and reemitted at a given rate, given, say, 
by a Gaussian distribution whose dispersion is dictated 
by the temperature of the wall, say: 

f(x,v) = p-(x)M w (v), v ■ n x > 0, 

where 

p-(x) = f{x,v)\v ■ n x | dv, 

Jv-n x <0 

e -\v\ 2 /(2T w ) 


Maxwell, understanding that the boundary behavior of 
a gas was a very complex matter, also considered com- 
binations of the above models; these are nowadays 
called Maxwell conditions. In order to find the stationary 
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solutions, that is, time-independent solutions of (2), he 
identified certain particular hydrodynamic solutions, 
which make the collision contribution vanish. These are 
Gaussian distributions with a scalar covariance: 


f(v) 


pe 


|l’-u|-/2T 


(2ttT)3/2 


where the parameters p > 0, u e R 3 , and T > 0 can be 
identified, respectively, as the density, mean velocity, 
and temperature of the fluid. These parameters can be 
fixed throughout the whole domain (providing, in this 
case, an equilibrium distribution) or they can depend 
on the position x and time t; in both cases, collisions 
will have no effect. It was remarkable that Maxwell 
could recover in this way the Gaussian distributions 
that already played a central role in probability theory; 
in the context of kinetic theory, these distributions are 
thus called Maxwellian. 

Maxwell went further and made the connection with 
classical fluid mechanics [IV.28], in which the equa- 
tions are expressed in terms of p, u, and T. He 
suggested that one could go from the kinetic equa- 
tions to hydrodynamic equations in certain regimes 
and therefore make some predictions about hydro- 
dynamic behavior from kinetic theory. Let us give, 
for instance, two counterintuitive effects that Maxwell 
guessed through kinetic formalism. One is that the vis- 
cosity of a low-density fluid hardly depends on density. 
Another is the paradoxical “thermal creep effect”: a gas 
that has a temperature gradient parallel to a fixed wall 
will have a tendency to flow from cold to hot near the 
wall. 

A few years after Maxwell’s masterpiece was pub- 
lished, Boltzmann rewrote and deepened the theory, 
completing the foundations of modern physical kinetic 
theory. 


2 Boltzmann’s Entropy and 
Collisional Relaxation 

The word “entropy” was coined by Clausius to desig- 
nate a certain quantity associated with the tendency 
to relax or achieve equilibrium. The properties of 
entropy in relation to exchanges of heat and energy 
were established empirically; in particular, the formula 
for infinitesimal variation of entropy was determined: 
dS = SQ/T (variation in entropy is proportional to 
the exchanged heat divided by the temperature). In 
this vein came the well-known second law of thermo- 
dynamics, which states that entropy can never decrease 
in an isolated system. Even though there were rules 
to compute the entropy of an equilibrium system, the 


interpretation of that quantity remained somewhat elu- 
sive, and the second law was considered more or less 
as an axiom. 

That changed radically with Boltzmann’s contribu- 
tion to the field (1872-77). In one of the most dramatic 
events in the history of statistical physics, Boltzmann 
introduced the following breakthroughs. 

• A general mathematical definition of entropy: it 
is the logarithm of the volume of microscopic 
states that are compatible with the (observable) 
macrostate. This is the celebrated Boltzmann for- 
mula: 

S = klogW, (3) 

where k is Boltzmann’s constant (notation intro- 
duced by Planck, who was the first to estimate its 
value) and W stands for the volume of microscopic 
states. Here, the volume may be computed with 
some natural measure on the phase space, which 
may be discrete or continuous, depending on the 
situation. 

• A practical formula for computing the entropy 
of a kinetic system: if f(x,v) is the distribu- 
tion function, then S = -JJ /log/dx dv. This 
is derived from Boltzmann’s formula (3) through 
a discretization procedure; it can also be seen 
as an infinite-dimensional analogue of Liouville’s 
volume measure. 

• A theorem showing that the entropy of a gas that 
obeys the equations discovered by Maxwell can 
never decrease. 


More precisely, Boltzmann’s H theorem states that 
for a rarefied gas modeled by a kinetic distribution 
f(t,x,v), governed by the Boltzmann equation with 
appropriate boundary conditions, the functional H = 
-S satisfies (i) dH/dt ^ 0 and (ii) dH/dt = 0 if and 
only if f(t,x,v) is a Maxwellian distribution with pos- 
sibly variable parameters p, u, T. Such a distribution 
can be called hydrodynamic, since it depends on only 
hydrodynamical quantities. 

In fact, an exact formula can be given for the entropy 
production: for, say, specular reflection, 


dS 

dt 


| D(f(t,x, ■)) dx, 

D(f) = 7 IT [ B(v - u*,n>) 

4 JJr3 X e 3 Js 2 


X (f(v')f(v*) -/(v)/(v*)) 


xlog 


/(u')/K) 

f(v)f(v*) 


dv dv* dco. 


( 4 ) 
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Establishing this expression was one of Boltzmann’s 
motivations for rewriting Maxwell’s kinetic equation 
(expressed in some sort of weak formulation) in the 
modern form (1). 

The H theorem implies, in particular, that station- 
ary states have to be hydrodynamic at all times; Boltz- 
mann showed that in the absence of special symme- 
tries this forces the distribution to be spatially homo- 
geneous. This homogeneous Maxwellian is the only 
equilibrium state, and in this case the entropy coin- 
cides with Clausius's entropy. Boltzmann’s beautiful 
proof was a giant conceptual leap. First, it provided 
a general definition of entropy, covering nonequilib- 
rium situations: entropy in the Boltzmann theory can 
be considered as the typical uncertainty that remains 
in the state of a random particle taken in the sys- 
tem. Boltzmann then showed that the second law of 
thermodynamics could be considered as a logical con- 
sequence of fundamental postulates instead of simply 
being accepted as a God-given fact. He also showed 
that equilibrium thermodynamics could in principle 
follow from nonequilibrium dynamics, and he iden- 
tified entropy increase, together with a scale separa- 
tion, as the major factors behind the emergence of 
hydrodynamics. 

Still, the most dramatic consequence of Boltzmann’s 
work was the discovery of the irreversibility contained 
in (1), even though that model was derived from New- 
ton’s reversible equations of motion. This emergence of 
irreversibility in the many-particle limit would trigger 
a heated controversy involving preeminent scientists 
such as Poincare, Loschmidt, and Zermelo; it is still con- 
sidered as the classical explanation of the irreversibility 
of time at the macroscopic level of description, in spite 
of the reversibility of the full-scale evolution. 

Entropy increase in the Boltzmann equation shows 
that particle configurations, loosely speaking, always 
evolve from unlikely to likely, from exceptional to typ- 
ical. Information is therefore continuously lost (the 
gas may have started in a very interesting, exceptional 
configuration, but it soon becomes quite uninterest- 
ing). Actually, the information is gradually transferred 
from the macroscopic observable degrees of freedom 
to the microscopic invisible ones. This loss of informa- 
tion can be related to the separation of scales inherent 
in the derivation of the Boltzmann equation: at each 
encounter between particles, the parameters of the col- 
lision (say, the orientation of the colliding pair) are 
invisible because they occur on a scale much finer than 
the spatial scale, so collisions are treated as perfectly 


localized and the impact parameter is treated proba- 
bilistically. The inexorable increase in entropy can also 
be attributed to the huge numbers involved in the com- 
putation of probability: N = 10 20 is a large number, 
but when one enumerates possible configurations, this 
number appears in a combinatorial way, leading to 
numbers such as 2 1020 , which are so large that they defy 
any human attempts to grasp them. 

After being imported into mathematics, entropy was 
extraordinarily successful in helping to solve problems, 
both related and unrelated to kinetic theory. It was 
rediscovered by Shannon when he was building the 
theory of communication, and it still plays a central 
role in information theory [IV.36]. It is the basis 
of Sanov’s formula in the theory of large deviations 
for empirical measures. It was a key concept behind 
Nash’s proof of the celebrated de Giorgi-Nash theorem 
of continuity of solutions of nonsmooth divergence 
parabolic equations. It lies at the core of the theory 
of logarithmic Sobolev inequalities, introduced by Nel- 
son and Gross as an infinite-dimensional replacement 
for Sobolev inequalities. It was one of the key technical 
tools in the theory of probabilistic hydrodynamic limits 
that was initiated by Varadhan and his colleagues and 
students; the role of entropy in hydrodynamical lim- 
its was further reinforced with Yau's relative entropy 
method. It was an important tool in the DiPerna- 
Lions theory of weak solutions of the Boltzmann equa- 
tion. Much further away from physics, entropy was 
adapted by Voiculescu in the context of free probabil- 
ity to help solve elusive problems from the theory of 
von Neumann algebras. 

One of the reasons for the ubiquity of entropy is its 
extensivity property, 

S(f®g)=S(f) + S(g), 

which is natural from the physical point of view 
(information associated with two independent vari- 
ables should add up) and implies an additive depend- 
ence on dimension, eventually leading to dimension- 
independent inequalities. 

Nowadays, entropy is still actively being used to 
develop new tools and techniques. A few recent works 
in which it has played a crucial role are the infinite- 
dimensional interpolation inequalities of Otto and Vil- 
lani, the amazing solution of the Poincare conjecture 
by Perelman, the theory of synthetic Ricci curvature 
bounds by Lott, Sturm, and Villani, and the solution 
of Kac’s problem of propagation of chaos for the spa- 
tially homogeneous Boltzmann equation by Mischler 



IV.25. Kinetic Theory 


433 


and Mouhot (after preliminary work by Carlen, Villani, 
and others). 


3 Landau Damping and 
Collisionless Relaxation 


For the first forty years after its birth, kinetic theory 
mostly focused on the effect of collisions, which are 
brutal encounters between particles. A notable devel- 
opment, starting with Lorentz in 1905, was the intro- 
duction of transport equations to describe the motion 
of particles wandering in an array of scattering obsta- 
cles, such as beams of electrons or neutrons in metals. 
The resulting linear collisional equations would later 
be found to have considerable importance in nuclear 
physics. 

However, around that time it was also realized that, 
in various cases, the collective effect of particles on 
each other is more important than collisions and leads 
to a rich variety of behaviors. This was the beginning 
of mean-held (noncollisional) theory. 

Mean-held theory was hrst introduced in galactic 
dynamics. In 1915 Jeans discussed the use of the Boltz- 
mann equation to model the evolution of galaxies over 
millions or billions of years, with each star considered 
as a particle. He came to the conclusion that, as a hrst 
approximation, collisions can be dropped and one can 
model the interaction by letting each particle feel a 
force held that is the resultant of all other particles. 

The story was then repeated in plasma physics, which 
is primarily governed by the Coulomb interactions 
between electrons. For such interactions the collision 
kernel can be computed explicitly but leads to a diverg- 
ing collision operator. In 1936 Landau remedied this 
situation by replacing the Boltzmann operator by an 
integro-differential collision operator: 


Qilf.f) 
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where A (the plasma parameter) is a large constant and 
whereay(z) = (<5y-ZjZj7|z| 2 )/|zj.Butinl938Vlasov 
pointed out that the effect of collisions can also be 
disregarded (except in long-time analysis), so the dis- 
tribution of electrons is mainly subject to the force 
generated by electrons, coupled with the electrostatic 
equations of Maxwell. 


Be it for galaxies or for plasmas, in both situations 
the basic evolution equation is 

§£ + v- V x f + F[f]- V v / = 0, 

F[f](t,x) = - |T f(t,y,w)VW(x - y) dy div, 

JJOXR 3 

( 6 ) 

where Q is the position domain and W is the interaction 
potential, which is assumed to be even. This equation, 
which comes with various admissible boundary condi- 
tions, is usually called the Vlasov equation, although 
the collisionless Boltzmann equation might be a his- 
torically more appropriate name. The most important 
cases are when W is the fundamental solution of the 
poisson equation [III. 18] -AW = 5 (Coulomb inter- 
actions, positive type) or AW = 5 (Newton interac- 
tion, negative type). Here, the notion of “type,” com- 
ing from harmonic analysis, refers to the sign of the 
Fourier transform of the fundamental solution. These 
cases give rise to the so-called Vlasov-Poisson equation. 

With collisions absent, the most striking features of 
the Boltzmann equation disappear; thus, (6) does not 
possess any meaningful Lyapunov functional, except 
that its solutions satisfy the conservation of energy, 

E = | JJjJ f(x,v)f(y, w)W(x - y) dx dy du die 

+ || fix, v)^y~ dxdu, 

and conservation of all nonlinear functions of the den- 
sity JJ C(f) dxdv. In particular, the entropy is con- 
stant. 

Another way in which (6) contrasts with (1) is that 
it has a surprisingly large collection of steady states; 
in the absence of an external field or boundaries, 
this includes, in particular, homogeneous distributions 
f°(v) but also many inhomogeneous periodic station- 
ary solutions. 

All of this seems to oppose the idea that solutions of 
(6) should display definite long-time behavior. It there- 
fore came as a huge surprise when, in 1946, Landau 
showed that the linearized analysis of (6) for Coulomb 
interactions led to the exponential decay of pertur- 
bations for a large class of equilibria and perturba- 
tions (e.g., if both the equilibrium and the perturba- 
tion are analytic distributions, and if the equilibrium 
has only one maximum in dimension 1 or depends 
only on \v\ in dimension 3). This effect has been 
dubbed Landau damping. Because it seemed to find 
irreversibility where everything looked reversible, it 
was probably as striking to contemporary physicists 



434 


IV. Areas of Applied Mathematics 


as Boltzmann’s discovery of the collisional increase of 
kinetic entropy, even though its physical impact was 
much more restricted than that of the H theorem. 

In contrast with Boltzmann’s H theorem, which is 
genuinely nonlinear, Landau damping was based on a 
linearized computation. Landau’s results were refined 
and extended by a number of physicists, including 
O’Neil, Penrose, Backus, Maslov, Fedoryuk, and others. 
When experiments became accessible, Landau’s com- 
putations were verified with a good degree of accu- 
racy, and Landau damping became one of the corner- 
stones of modern classical plasma physics. It was later 
exported to galactic dynamics by Lynden-Bell. 

Having been discovered by mathematical computa- 
tion, Landau damping has led to considerable specula- 
tion about its driving mechanism, and a number of mis- 
leading ideas have been generated. The most convinc- 
ing interpretation is that collisionless transport phe- 
nomena involve a mixing of the distribution function 
via very fast kinetic oscillations, which in the stable 
case have a tendency to wipe out inhomogeneities. 

Even though the collisional and collisionless analy- 
ses are both idealizations, they constitute the basis of 
most of our current understanding of kinetic theory. 
They can also interact with each other: for exam- 
ple, the tendency to homogenize faster than expected 
can enhance the impact of diffusion or collision on 
relaxation phenomena. 

At the same time as all this analysis was being devel- 
oped, mean-field analysis started to be applied to a 
number of situations outside kinetic theory, in both 
equilibrium and nonequilibrium systems. In particular, 
in a famous discussion of turbulence, Onsager stud- 
ied the incompressible two-dimensional Euler equation 
in vorticity form as a mean-field system of “vortices,” 
presenting many similarities with collisionless kinetic 
theory. 

4 Driving Problems 

The development of kinetic theory, and Boltzmann’s 
very influential book, quickly attracted the attention of 
mathematicians, starting with Hilbert, who formulated 
his sixth problem (related to items (I) and (IV) in the 
list below) under the inspiration of Boltzmann. Hilbert 
himself did some early mathematical study of the Boltz- 
mann equation, as did Carleman in the 1930s and then 
Grad and Kac in the 1950s. These works focused on 
Boltzmann’s collision operator. 


As the theory of partial differential equations (PDEs) 
was making progress, the effects of the tricky operator 
v ■ V.v in the equation started to be analyzed, both in 
the case when the operator stands on its own and when 
it is coupled with other typical operators appearing in 
statistical evolution equations. This can be traced back 
to Kolmogorov’s work on the fundamental solution of 
the kinetic Fokker-Planck equation. 

It took longer for the Vlasov (collisionless) theory to 
make its way into mathematics; this task was under- 
taken only at the end of the 1970s in Russia with the 
work of Arsen’ev and Dobrushin. Soon after, Braun, 
Hepp, and Neunzert followed in the Western world. 

A number of problems emerged from these works, at 
the interface between mathematics and physics; they 
have been driving the field for decades and they trig- 
gered far-reaching developments. For the most part, 
these problems fall into five general themes, all of 
which are related to each other. 

(I) Derivation from first-law principles. Starting from 
fundamental equations such as Newton's laws or 
certain simple diffusive microscopic models, derive 
kinetic statistical equations. To derive collisional 
equations one is often led to justify, directly or 
indirectly, Boltzmann’s chaos assumption : that pre- 
collisional configurations are uncorrelated. Chaos 
assumptions also play an important role in noncol- 
lisional models and more generally in the deriva- 
tion of any deterministic equation on the distribution 
function. 

(II) The Cauchy problem and qualitative analysis. 

Starting from an initial datum that satisfies certain 
assumptions about smoothness and decay at large 
velocities, prove that the solution is well behaved, 
make precise the way in which it solves the kinetic 
equation, and establish whether bounds of smooth- 
ness, large-velocity decay, and strict positivity are 
preserved in time. Is there regularization, or at least 
decay of the amplitude of singularities? In many sit- 
uations, a lack of understanding of the Cauchy prob- 
lem precludes progress in the derivation problem. 

(III) Long-time behavior. Starting close to some equi- 
librium, does the solution remain close to the equi- 
librium for all times (orbital stability)? Does it con- 
verge to the equilibrium or to some other equilibrium 
(dynamical stability)? Starting far from equilibrium, 
does it converge to some equilibrium, and can one 
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identify that equilibrium? Are there mixing proper- 
ties (e.g., oscillations developing in time and leading 
to some weak convergence mechanism)? 

(IV) Relationships with other models. Can one re- 
place, in suitable asymptotic regimes, the kinetic 
equations with reduced models, such as compress- 
ible or incompressible hydrodynamic equations (in 
the hydrodynamic limit, that is when the mean free 
path becomes negligible with respect to the spatial 
scale), or boundary-layer equations (when one looks 
very close to an interface)? Can one couple kinetic 
models with other models? Or reduce the descrip- 
tion by using a multiscale analysis? Can one use 
the kinetic equations to retrieve observable proper- 
ties of fluids, such as thermodynamic laws of pres- 
sure, viscosity dependencies, or phase-transition dia- 
grams? An important limitation of Boltzmann’s clas- 
sical theory is that it covers only perfect fluids, that 
is, those with a pressure law that is proportional to 
the product of the density and the temperature. 

(V) Numerical simulation. Can one devise numerical 
methods that are fast and accurate? Ones that are 
particulary suitable to predict the value of a given 
quantity? Ones that satisfy given constraints? Can 
one prove that these schemes converge to the solu- 
tions of the corresponding kinetic equations? The lat- 
ter problem may be strongly related to the derivation 
problem because a number of schemes are based on 
particle simulations. It is also obviously related to the 
analysis of the Cauchy problem. 

In view of their archetypical nature and the role 
they play in fundamental issues such as the arrow of 
time, the basic equations of kinetic theory have aroused 
interest among theoretical physicists, going far beyond 
the range of application of these models. 

We will describe some of these problems in more 
detail after discussing the models more precisely. 

5 The Many Models of Kinetic Theory 

Initially, kinetic theory was devised to model rar- 
efied gas dynamics (the Boltzmann equation), galactic 
dynamics (the mean-field model with Newton interac- 
tion), and ideal plasma dynamics (the mean-field model 
with Coulomb interactions). All three domains of appli- 
cation are important; for instance, the Boltzmann equa- 
tion is crucial in high-altitude aerodynamics, since the 
upper atmosphere is not dense enough for the laws of 
hydrodynamics to apply satisfactorily. More recently, 


Boltzmann equations have been found to be useful in 
the modeling of nanofluids. 

The kinetic formalism is versatile, and its range 
of application has been widened considerably beyond 
these situations. The many resulting variants of the 
basic equations can be grouped into several categories. 

(1) Classical models with an interaction kernel derived 
from various molecular interactions or modified from 
those that come from the laws of classical physics. 
In particular, since Grad’s work on the properties of 
the linearized Boltzmann operator, one often truncates 
small deviation angles to ensure the angular integra- 
bility of the collision kernel. Under this assumption of 
angular cutoff, the collision operator can be split into 
two parts: 

Q(/>/) = Q + (/,/) - QT(f,f) 

= JJ B(u “ v*,in)f(t,x,v')f(t,x,v' if ) df* dot 

B(v - y*, m)f(t,x,v)f(t,x, v*) dv* dm, 

which are called the gain and loss parts of the operator, 
respectively. By contrast, a kernel that is nonintegrable 
in the angular variable is called “noncutoff.” This ker- 
nel corresponds to long-range interactions. Moreover, 
the interaction is called hard if the corresponding col- 
lision kernel is proportional to a positive power of the 
relative velocity, and it is called soft if the kernel is 
proportional to a negative power of the relative veloc- 
ity. In between these extremes lies the Maxwellian case, 
where the kernel does not depend on the relative veloc- 
ity. Hard, Maxwellian, and soft potentials often enjoy 
distinctive properties. A particular case is that of hard 
spheres, in which the kernel is simply proportional to 
\{v - v*,a>)|. 

(2) Models obtained from large particle systems by 
putting emphasis on various interactions according to 
physical conditions (density, strength of interaction, 
etc.). Popular and versatile models in this category are 
the Fokker-Planck equations, which date back to the 
1930s and describe the evolution of a crowd of par- 
ticles undergoing stochastic diffusion and determinis- 
tic drift. Systematic derivation of statistical models for 
particle systems goes back to Bogolyubov. It is espe- 
cially In the field of plasma physics that this approach 
has led to a large number of variants. The Balescu- 
Lenard and Vlasov-Fokker-Planck-Landau equations 
are among the best known of these models; they incor- 
porate both mean-field and collisional interactions, 
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with collision operators that behave like nonlinear dif- 
fusions in velocity space while bearing a resemblance 
to the integral and bilinear structure of the Boltzmann 
operator (recall (5)). These models try to reduce the 
great variety of processes that go on in plasmas to 
tractable equations. 

(3) Linear models describing the interaction of a parti- 
cle system with a given (deterministic or random) envi- 
ronment. Important examples that fall into this cat- 
egory include the linear Boltzmann equation, which 
describes the scattering of particles by a cloud of ran- 
domly located obstacles; the archetypal kinetic equa- 
tion of Fokker-Planck type (studied by Kolmogorov), 

dtf + v- V x f - A v f = 0; (7) 

and the equations of electron transport, which are use- 
ful in neutronics and in semiconductor theory. More 
generally, we can also include a variety of equations 
describing combinations of transport, scattering, dif- 
fusion, and so on. 

(4) Spatially homogeneous models, in which one stud- 
ies solutions that do not depend on the position vari- 
able, only on the kinetic variable. The most impor- 
tant of the resulting models is the spatially homoge- 
neous Boltzmann equation d t f = Q(f, /), the study of 
which is very well developed; this equation allows the 
understanding of fine properties of the collision oper- 
ator. Additional structure can be achieved by restrict- 
ing the setting even further, e.g., by considering only 
Maxwellian interactions, in which the collision kernel 
B(v - v*,co) depends on only the deflection angle 
0. The dimension can also be reduced, leading for 
instance to Kac's one-dimensional caricature of a Boltz- 
mann gas, in which velocities are one dimensional 
and the conservation of energy has been kept but the 
conservation of momentum has been dropped. 

(5) Linearized equations, obtained by looking at first- 
order perturbations. For the Boltzmann equation near a 
homogeneous Maxwellian M = M(v), the linearization 
is 

3 t h+v ■ V x h = Q(h,M) + Q(M, h), 

which is often further transformed by conjugation with 
a multiplication operator. For the Vlasov equation near 
a homogeneous equilibrium f° = f°(v), this is 

3 th + v ■ V x h + F[h ] • V„/° = 0. 

In both cases, spectral properties depend strongly on 
the interaction potential and have been the object of 
numerous studies. Many variants of these archetypal 
models are available. 


(6) Delocalized models, in which particles are allowed 
to have a nonnegligible interaction range. While this 
procedure is logically inconsistent with the many- 
particle limit, it does produce some useful equations, 
such as the Povzner equation and, especially, the 
Enskog equation, which is used in the description of 
granular matter. 

(7) Models incorporating different physical laws: inelas- 
ticity (replacing the energy conservation by a dissipa- 
tion law; this approach is especially important in the 
modeling of granular matter), quantum physics (either 
by modeling quantum phenomena in the interaction 
terms, thus leading to Boltzmann-Bose or Boltzmann- 
Fermi models for bosons or fermions collisions, or by 
keeping a classical description of collisions but incor- 
porating quantum effects in the computation of the 
cross section), relativity (either by incorporating the 
geometry of special relativity into the laws of inter- 
action, or by coupling a kinetic equation to the con- 
stitutive equations of general relativity), and so on. 
In relation to relativity it should be noted that the 
Einstein equations of general relativity cannot “stand 
on their own” unless one studies the vacuum: these 
equations need to be coupled to an evolution equa- 
tion for matter, satisfying certain conditions. Since the 
pioneering works of Choquet-Bruhat on the Cauchy 
problem in general relativity, the Vlasov equation has 
been studied in this context, giving rise to the so-called 
Vlasov-Einstein model. 

(8) Coagulation-fragmentation models incorporating 
crude modeling of chemical reactions, drop formation 
from molecules via larger and larger gatherings, gela- 
tion problems, etc. The Smoluchowski equation is one 
of the most popular models in this respect. 

(9) Discrete velocity models and lattice models, devised 
to simplify the geometry of collisions and the phase 
space, e.g., for numerical simulations. 

(10) Models appearing in various other physical con- 
texts, such as interactions between waves in models of 
weak turbulence. Another example here is the collision- 
less kinetic equation that is obtained by application of 
the Wigner transform to the Schrodinger equation. 

(11) Kinetic equations in interaction with other physical 
phenomena: coupling of radiative transfer and hydro- 
dynamics (in astrophysics or nuclear physics), of par- 
ticles and hydrodynamic fluids (e.g., in sprays), and 
so on. 
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(12) Phenomenological models for various interaction 
phenomena that are difficult to classify or evade pre- 
cise physical modeling: crowds, traffic, disease trans- 
mission, sexual reproduction, etc. 

6 The Many Mathematical 
Faces of Kinetic Theory 

Modern kinetic theory enjoys an enviable position in 
the mathematical landscape: standing on top of it, the 
curious observer can view most of the regions of analy- 
sis, as well as significant territories of probability and 
geometry. In the past thirty years this theory has inter- 
acted with many other fields and displayed a number 
of sophisticated developments. In the same period, it 
has moved from being a rather minority field to being 
one that is center stage. 

Some of the particular features of kinetic theory 
that differentiate it from other areas of mathematical 
physics are 

• the presence of two variables (position and veloc- 
ity); 

• the omnipresence of large velocities, which cannot 
be truncated in the model; 

• the degeneracy of most equations in the spatial 
variable; 

• the intricate geometry of collisions; 

• the fact that kinetic theory is at a cross-point 
between several areas of modeling; and 

• the interplay of deterministic and chaotic behav- 
ior. 

With these things in mind, here are some of the main 
mathematical tools and trends in kinetic theory. 

Spectral theory. The linearized Boltzmann equation 
was one of the first model cases of study for integro- 
differential operators. Some of the important notions 
here are spectral gap estimates, the Fredholm alter- 
native, self-adjointness, the localization of the essen- 
tial spectrum, compact perturbations, compactness 
of the resolvent, accretivity, etc. 

Nonlinear analysis of the Cauchy problem. Tools 
involve a priori estimates (starting with mass, energy, 
and entropy controls), the Cauchy-Kowalevskaya 
theorem (especially in the short-time derivation of 
the Boltzmann equation, or in early theories of the 
Vlasov-Poisson equation), Kolmogorov-Nash-Moser 
perturbation techniques, the Moser scheme (espe- 
cially for the spatially homogeneous Boltzmann 


equation for long-range interactions, which has dis- 
sipative features), weak compactness theorems (par- 
ticularly in the DiPerna-Lions theory of weak solu- 
tions), Sobolev trace theorems, and weighted func- 
tional spaces of Lebesgue, Sobolev, analytic, and 
Gevrey type. Bilinear and trilinear estimates with a 
strong input from harmonic analysis have recently 
been developed for the study of long-range interac- 
tions, in which the collision operator behaves more 
or less like a nonlinear fractional derivation. This 
list must also include both the nonlinear changes of 
variables used by DiPerna and Lions in their notion 
of “renormalized” solutions, and the “gliding” regu- 
larity analysis (regularity obtained after composing 
the function with a transport equation) used for the 
study of fast oscillations of the Vlasov equation in 
large time, etc. 

Harmonic analysis. Fourier analysis, either in the po- 
sition variable or the velocity variable, has played 
a crucial role in various parts of kinetic theory, 
most notably in the analysis of the spatially homo- 
geneous Boltzmann equation with Maxwellian inter- 
actions (for which the Fourier transform of the col- 
lision operator is particularly simple and tractable), 
in the long-time perturbative analysis of the nonlin- 
ear Vlasov equation (Landau damping being analyzed 
mode by mode), in the regularity of the gain part of 
the Boltzmann operator (which is more regular, by a 
fractional amount, than the density function), and in 
velocity-averaging estimates. The latter are intended 
to answer the following type of question. Given an 
equation like v ■ V x f = g, with certain regularity 
information on f and g, show that, if a smooth test 
function c p is given, then f f(x,v)q?(v) du enjoys 
more regularity than can be predicted solely from the 
regularity of f. This point of view has been extremely 
fruitful in modern studies of the Cauchy problem and 
is based mainly on Fourier or X-ray transforms. 

Entropic inequalities. The analysis of the long-time 
behavior of collisional kinetic equations naturally 
leads to the study of inequalities relating Boltz- 
mann’s entropy to its rate of production. Kac and 
McKean were the first to address these issues from 
a mathematical point of view, and they made the 
connection with information-theoretical inequalities, 
involving Fisher information, for example. A central 
topic in the field came to be known as Cercignani’s 
conjecture : is it true that, under certain conditions 
of normalization or regularity, the entropy produc- 
tion (4) satisfies the functional inequality D(f) ^ 
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K[S(M) - S(f)], where M is a Maxwellian distri- 
bution? This problem, which is the entropic vari- 
ant of a spectral gap inequality, has led to unex- 
pected and rich developments related to logarithmic 
Sobolev inequalities and to the Shannon-Stam and 
Blachman-Stam inequalities. 

Semigroup arguments. At the end of the 1990s, semi- 
group arguments made their way into kinetic theory, 
either through the use of auxiliary diffusion equa- 
tions or via the second variation method introduced 
by Bakry and Emery in their study of logarithmic 
Sobolev inequalities. The ideas of Bakry and Emery 
were adapted in the context of PDEs by Toscani, 
Arnold, Markowich, and others; a large body of works 
followed on the entropic analysis of the convergence 
to equilibrium after a long time, for both linear and 
nonlinear models in kinetic theory, especially those 
of Fokker-Planck type. Some of the key concepts 
are the IT calculus, curvature-dimension inequali- 
ties, and dissipation of entropy production (i.e., the 
second time-derivative of the entropy). 

Specific techniques for degenerate operators. Kol- 
mogorov in the 1950s and Hormander in the 1960s 
founded the theory of hypoellipticity, according to 
which certain degenerate operators, like —v ■ V x + , 
generate a regularizing semigroup. This situation, 
which occurs frequently in linear-dissipation kinetic 
models, is often treated by commutator estimates. 
The more recent theory of hypocoercivity deals with 
the time decay of semigroups generated by degen- 
erate operators, typically of the form T + A, where 
A is coercive in some appropriate subspace and T 
is skew-symmetric. Paradigmatic examples are - v ■ 
V.v + - v ■ V v inI 2 (Mdxdu) and — v ■ V x +I7m -Id 
in L 2 (M dx dv ) , where M is a Gaussian and 17 m is the 
orthogonal projection on constant functions. 

Qualitative studies of solutions. The Vlasov equation 
is of hyperbolic type, but the Boltzmann equation is 
of mixed hyperbolic/parabolic type in some sense; a 
number of works and techniques have been devoted 
to the study of the qualitative behavior of solu- 
tions, including regularization, propagation, or decay 
of singularities (often studied in Sobolev spaces), 
wave patterns (in connection with systems of con- 
servation laws and compressible Navier-Stokes equa- 
tions), harmonic analysis, pseudodifferential opera- 
tors, Littlewood-Paley analysis, the Radon transform, 
quantitative uncertainty principles, the self-similar 
ansatz, concentration analysis, and so on. 


Singular limits. These limits, in which a term of the 
equation is enhanced by a diverging coefficient, are 
studied via ansatzes, expansions, spectral theory, 
ergodic theory, etc. They appear in particular in con- 
nection with (a) inviscid or viscous hydrodynamic 
limits, in which the Knudsen number (the ratio of 
the mean free path to the typical length) goes to 0, 
typically leading to an enhanced collision opera- 
tor, £ _1 Q(/,/); (b) the homogenization of transport 
models, typically leading to an enhanced transport 
term, ■ V x or a~ l F ■ V v ; (c) high-frequency 

semiclassical limits of Schrodinger equations, via the 
Wigner transform; (d) small mass ratio limits, e.g., 
in plasmas, where electrons are much lighter than 
nuclei. 

Differential geometry. Curved phase spaces appear 
naturally in relativistic kinetic theory, either through 
the rules of collisions between particles or because 
the system is considered in a Lorentzian ambient 
space. 

Calculus of variations. When stability is not ensured 
by a Lyapunov functional such as the entropy, sta- 
bility issues can be very tricky. Convexity properties 
can then be crucial in studying the dynamic stabil- 
ity of particular equilibria that are energy minimiz- 
ers. This approach, introduced into hydrodynamics 
in the 1960s, was systematically used from the 1980s 
on in the study of the Vlasov-Poisson equation, with 
the help of notions of concentration-compactness, 
rearrangement, etc. 

Many-particle techniques. The quest for a rigorous 
foundation for the Boltzmann and Vlasov equa- 
tions from particle systems has led to the analysis 
of many-particle systems, obeying the fundamental 
laws of classical or quantum mechanics, in the limit 
where the number N of particles diverges to infin- 
ity. The microscopic equations then depend on all 
positions and velocities, say (x\,v\), ..., (xn,vn)- 
The problem can be formulated in terms of the likely 
behavior of the empirical distribution, say, p N = 
N~ l XiL, 5(x,,v p, or in terms of the limit behavior of 
the first-particle marginal of an Af-particle distribu- 
tion f N , satisfying the AT-particle Liouville equation 

dtf N + I i’i ■ V Xi f N - X VW(Xi - xj) ■ V v ,f N = 0, 

in various asymptotic regimes where time, space, 
mass, and the strength of interaction may be re- 
scaled. Popular scalings are the Boltzmann-Grad 
limit, pioneered by Grad, Cercignani, and Lanford, 
which led to the nonlinear Boltzmann equation, and 
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the mean-held limit, established by Braun, Hepp, 
Dobrushin, and Neunzert for smooth interactions, 
which led to the Vlasov equation. Famous variants are 
the probabilistic approach of Kac, in which the deter- 
ministic Newton equations are replaced by a phe- 
nomenological, stochastic microscopic model, and 
the derivation of the linear Boltzmann equation (for 
a so-called Knudsen gas) pioneered by Gallavotti. Key 
concepts in this held are notions of molecular chaos 
(asymptotic independence of particles, i.e., low corre- 
lations), perturbative series, particle histories, func- 
tional inequalities in inhnite dimension, quantitative 
laws of large numbers, central limit theorems, and 
orthogonal polynomials. For quantum particle sys- 
tems, density matrices and Wigner transforms play a 
crucial role. 

Numerical analysis. The simultaneous presence of 
very diverse terms in the equations, the high-dimen- 
sional phase space, the complexity of collisions, 
the presence of large velocities and small densi- 
ties, and the difficulty of accurate experiments have 
all made numerical simulation of kinetic models a 
challenging area. Transport phenomena are most 
often simulated with the help of the methods of 
characteristics, that is, following particles in phase 
space; but the reconstruction of the density from 
one time step to the next leads to many subtleties, 
since the number of particles used in the simula- 
tion is always much smaller than the actual num- 
ber of particles and since particle trajectories do 
not preserve grids or other discretizations of the 
phase space. Collisional phenomena are tricky to 
compute and were initially handled by stochastic 
methods, based on particle systems obeying more or 
less realistic interaction rules. These schemes were 
founded by Bird in the 1960s and remained dom- 
inant for more than forty years. It is only in the 
last decade that the progress of algorithms and com- 
puter power have made more accurate determinis- 
tic methods competitive in cost, at least in certain 
situations. Keywords in this area are the splitting 
method, Monte Carlo simulation, consistency analy- 
sis, Lagrangian and semi-Lagrangian methods, spec- 
tral analysis, the Fourier transform, the fast Fourier 
transform algorithm, finite elements, lattice simula- 
tion, conservative schemes, adaptive grids, etc. Spe- 
cific methods were developed by Cheng and Knorr, 
Sone, Aoki, Babovsky, Neunzert, Wagner, Degond, 
Bobylev, Rjasanow, Sonnendriicker, Pareschi, Filbet, 
and many others in aeronautics, in astrophysics, 


in plasma physics— there is an enormous amount 
of literature and it is barely touched upon in this 
article. 

Let us conclude our list with two subjects that were 

partly motivated by kinetic theory but where the main 

impact was made in other parts of mathematics. 

Ordinary differential equations (ODEs) with rough 
coefficients. The classical theory of ODEs, say x = 
g(t,x), requires continuity of g for the local exis- 
tence of a flow, and Lipschitz regularity of § for 
local uniqueness and continuous dependence. This 
requirement of Lipschitz regularity is often a strong 
restriction in applications to PDEs, especially when g 
depends on the solution and its regularity is a pri- 
ori unknown. As a by-product of their studies of 
the Cauchy problem in kinetic theory, DiPerna and 
Lions came up with a theory of ODEs that provides 
local existence and uniqueness for almost every ini- 
tial data, under a more lenient assumption of Sobolev 
regularity (e.g., g 6 11j oc , divg e L°°, and some 
growth condition at infinity). The original proof was 
based on the analysis of the transport equation d t f+ 
g ■ V/ = 0 and the renormalization technique; more 
recently, the theory was refined by Ambrosio, de Lei- 
ns, and others to include the limit case of bounded 
variation regularity and to provide alternative, tra- 
jectorial proofs. This theory has been used in var- 
ious types of PDEs, such as hyperbolic systems of 
conservation laws. 

Optimal transport and metric geometry. In the 1970s 
it was shown by Tanaka that the spatially homo- 
geneous Boltzmann equation with Maxwellian inter- 
actions is contracting for the Wasserstein (optimal 
transport) distance 

W 2 (ii,v) = ( inf f |v-v*| 2 d7T(v,v*)) 

V nemy.v) / 

where p and v are two probability measures on R d , 
and the infimum is over all joint probability mea- 
sures 7T(dvdv*) with marginals p and v. After a 
period in which these connections sank more or less 
into oblivion, the links between optimal transport 
and kinetic equations were renewed in the 1990s and 
led to various uniqueness and stability results. Later, 
the interplay between optimal transport and Boltz- 
mann’s entropy played a key role in the theory of 
nonsmooth Ricci curvature. 
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7 Landmarks 

A list of fifty striking works, arranged chronologically, 
appears below. These works have punctuated progress 
in kinetic theory and have become classical references. 
Some of these works have opened up a new field of 
research, while others have closed an existing one; 
some are gathered together if they are very close in 
terms of subject matter. The dates that are given are 
those of publication, which sometimes came a few 
years after the actual work was carried out. 

This list is not intended to convey overall importance, 
and the selection is partly a personal matter. It is biased 
in favor of theoretical issues and does not do justice to 
some subjects that are of great importance in industrial 
applications, such as neutron transport or coagulation- 
fragmentation. Similarly, it does not even touch on the 
enormous and inventive body of work that has been 
undertaken in the held of numerical simulations. 

The list also partly reflects the history of mathe- 
matical kinetic theory: at first, it was quite a minority 
subject, with rare contributions; but a few centers of 
wider study emerged after World War II (New York (at 
the Courant Institute), Osaka/Kyoto, Goteborg, Rome, 
Moscow, and Zurich). In the 1980s and 1990s, the study 
of nonlinear problems started to flourish, with the 
French and Italian schools taking the lead, and the com- 
munity became quite organized; at the same time, new 
research groups were emerging, most notably in Ger- 
many, the United States, Austria, Spain, Taiwan, and 
Canada. The past twenty years have been character- 
ized by a growing and fruitful interplay between kinetic 
theory and other mathematical fields, by an emphasis 
on quantitative results and constructive methods, and 
by renewed study of the linearized and perturbative 
settings. 

(1) Hilbert (1912). The first study of the linearized 
Boltzmann operator and the first formal expansion of 
the solution of the Boltzmann equation in powers of 
the (small) Knudsen number near the hydrodynamic 
regime. 

(2) Chapman and Enskog (1917). Systematic formulas 
for deriving macroscopic transport coefficients from 
microscopic interactions and for perturbative expan- 
sion near the hydrodynamic regime (as an alternative 
to Hilbert’s work). 

(3) Carleman (1932). The first solution of the nonlinear 
Cauchy problem for the spatially homogeneous Boltz- 
mann equation, for hard spheres; this included a qual- 


itative study of the lower bound for the density and a 
study of the convergence to equilibrium. 

(4) Kolmogorov (1934). Computation of the fundamen- 
tal solution of the kinetic Fokker-Planck equation 3 t f+ 
v ■ V x f = A v f, displaying (hypoelliptic) regularity 
properties. 

(5) Grad (1949). A thirteen-moment system describing 
a high-order approximation to the hydrodynamic limit 
of the Boltzmann equation. 

(6) Kac (1954). The probabilistic foundation of kinetic 
theory through a phenomenological stochastic many- 
particle model of the spatially homogeneous Boltz- 
mann equation; this led to conjectures on quantitative 
relaxation rates. 

(7) Backus (1960), Penrose (i960). A mathematical treat- 
ment of linear Landau damping, discovered by Landau 
in 1946, with sharp criteria for stability; statement of 
the nonlinear damping problem. 

(8) Carleman (1949, 1957), Grad (1963-65). The mod- 
ern spectral theory of the linearized Boltzmann equa- 
tion with cutoff, for hard interactions, in the homoge- 
neous and inhomogeneous settings. 

(9) McKean (1965). Probabilistic study of Kac’s cari- 
cature of the Boltzmann equation through molecular 
chaos, Fisher information estimates, and the quantita- 
tive central limit theorem for Maxwell interactions. 

(10) Bird (1966). The first numerical scheme for the 
stochastic simulation of the Boltzmann equation; an 
alternative scheme was later introduced by Nanbu 
(1983). 

(11) Hormander (1967). General criteria for the hypoel- 
lipticity of degenerate diffusion equations, including a 
precursor to velocity-averaging lemmas. 

(12) Gallavotti (1969). Derivation of the linear Boltz- 
mann equation from the Lorentz gas through averag- 
ing over a random environment. Many developments, 
by Pulvirenti, Desvillettes, and others, followed from 
this work. 

(13) Arkeryd (1972). The Cauchy problem for the spa- 
tially homogeneous Boltzmann equation with hard 
potentials, in weighted L 1 spaces, including weak com- 
pactness properties in I 1 . 

(14) Tanaka (1973). Contraction properties of the spa- 
tially homogeneous Boltzmann equation with Maxwell 
kernel in the Wasserstein W 2 distance. 

(15) Lanford (1974). Short-time derivation of the Boltz- 
mann equation from deterministic Newton laws. 



IV.25. Kinetic Theory 


441 


(16) Ukai (1974). Perturbative solutions of the full 
inhomogeneous Boltzmann equation, based on the 
spectral theory of the linearized equation. 

(17) Bobylev (1976-88). Systematic study of the spa- 
tially homogeneous Boltzmann equation with Maxwell 
interactions via the Fourier transform. 

(18) Braun and Hepp (1977), Dobrushin (1979), Neun- 
zert (1984). A rigorous mean-field limit for the Vlasov 
equation with smooth interactions. 

(19) Sznitman (1984). Propagation of chaos and a prob- 
abilistic derivation of the spatially homogeneous Boltz- 
mann equation with hard spheres. 

(20) Golse, Perthame, and Sends (1985). The start of 
the systematic study of velocity-averaging lemmas in 
Sobolev spaces, which had been independently intro- 
duced by Agoshkov (1984) shortly before. 

(21) Glassey and Strauss (1986). The Cauchy prob- 
lem for the relativistic Vlasov-Maxwell equation, con- 
ditional to a conjectured property of compact support. 

(22) Bony (1987). A new Lyapunov functional for the 
discrete-velocity inhomogeneous Boltzmann equation 
in one space dimension; this was the starting point 
for various Lyapunov functionals for the Boltzmann 
equation in one space dimension. 

(23) DiPerna and Lions (1989). The existence and sta- 
bility of weak solutions (“renormalized solutions”) in 
the large, for the nonlinear inhomogeneous Boltzmann 
equation. 

(24) Bardos, Golse, and Levermore (1991). A systematic 
program for the proof of hydrodynamic limits of weak 
solutions, in particular in the incompressible regime; 
this program would take twenty years to complete. 

(25) Lions and Perthame (1991), Pfaffelmoser (1992). 
The first proofs of existence and uniqueness of classi- 
cal solutions for the three-dimensional Vlasov-Poisson 
equation, by two different approaches. 

(26) Desvillettes (1989), Carlen and Carvalho (1992, 
1994). The first lower bounds on the instantaneous 
rate of entropy production in the Boltzmann equation 
through the quantitative H theorem and information 
theory. 

(27) Desvillettes (1993). Refined moment estimates for 
the spatially homogeneous Boltzmann equation; in par- 
ticular, their immediate appearance in the case of hard 
potentials. 

(28) Lions (1994). The regularity of the gain term of the 
Boltzmann collision operator, which is shown to have 


the structure of a singular integral operator, gaining 
up to one derivative in three dimensions for smooth 
kernels. 

(29) Desvillettes (1995). The first evidence of regulariza- 
tion due to long-range interactions in the Boltzmann 
equation on a spatially homogeneous caricature; the 
start of a long series of works on such regularization 
effects. 

(30) Gerard, Markowich, Mauser, and Poupaud (1997), 
Lions and Paul (1993). The systematic study of high- 
frequency limits through the Wigner transform, with 
applications to quantum kinetic theory. 

(31 ) Erdos and Yau (1998). Derivation of the linear quan- 
tum Boltzmann equation, in the weak coupling limit, 
for the Wigner distribution of a quantum particle in a 
random environment. 

(32) Mischler and Wennberg (1999), Lu (1999). Optimal 
conditions for the well-posedness of the Cauchy prob- 
lem of the spatially homogeneous Boltzmann equation 
with hard interaction and cutoff. 

(33) Carlen, Gabetta, and Toscani (1999). Optimal rates 
of convergence to equilibrium for the spatially homo- 
geneous Boltzmann equation with Maxwell interaction 
and angular cutoff (the removal of the cutoff was later 
obtained in subsequent works involving Wennberg, 
Dolera, and Regazzini). 

(34) Toscani and Villani (1999), Villani (2003). Sharp 
entropy production bounds for the Boltzmann equa- 
tion, solving, or nearly solving (depending on assump- 
tions), Cercignani’s conjecture; this work was based 
on semigroup methods, information theory, and the 
Landau equation. 

(35) Guo (2002). The first of a series of works using 
energy methods to work out robust perturbative the- 
ories of the Boltzmann equation and other kinetic 
models. 

(36) Carlen, Carvalho, and Loss (2003), Maslen (2003). 
Determination of the L 2 spectral gap for Kac’s ran- 
dom walk in arbitrarily large dimension, after a uniform 
lower bound was established by Janvresse. 

(37) Carlen and Lu (2003). Examples of arbitrarily slow 
convergence to equilibrium for the Boltzmann equation 
with Maxwell interactions. 

(38) Bobylev, Gamba, and Panferov (2004), Gamba, Pan- 
ferov, and Villani (2004). Moment estimates and the 
Cauchy problem for the inelastic spatially homoge- 
neous Boltzmann equation with hard interactions. 
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(39) Alexandre and Villani (2004). Weak solutions for 
the spatially inhomogeneous Boltzmann equation with- 
out cutoff, and the asymptotic regime of predominantly 
grazing collisions in the spatially inhomogeneous case. 
This came after much progress in the understanding 
of grazing collisions by Alexandre, Desvillettes, Villani, 
and Wennberg. 

(40) Liu and Yu (2004). The first works on pointwise 
stability and the Green function in Boltzmann theory, 
and in kinetic shock wave analysis (motivated by an 
earlier work of Caflisch), by analogy with systems of 
conservation laws; the start of a long series of works. 

(41) Golse and Saint-Raymond (2004). A rigorous proof 
of the incompressible hydrodynamic limit for weak 
solutions of the Boltzmann equation; this came after 
works by Golse, Levermore, Masmoudi, and others. 

(42) Des\’illettes and Villani (2005). Quantitative con- 
vergence to equilibrium for the Boltzmann equation, 
far from equilibrium, by entropy methods, under a 
conjectural condition of regularity bounds. 

(43) Baranger and Mouhot (2005), Mouhot (2006), Gual- 
dani, Mischler, and Mouhot (201 3). Optimal rates of con- 
vergence for the Boltzmann equation (homogeneous 
and inhomogeneous), coupling quantitative spectral 
analysis to entropy methods, conditional on regularity. 

(44) Mischler and Mouhot (2006). A proof of Haff’s law 
of decay of temperature and self-similar stability in the 
theory of granular (inelastic) gases. 

(45) Villani (2009). General criteria for hypocoercivity, 
in both linear and nonlinear situations. 

(46) Gressman and Strain (201 1 ), Alexandre, Morimoto, 
Ukai, Xu, and Yang (2011). Construction of smooth 
solutions for the noncutoff spatially homogeneous 
Boltzmann equation, for potentials that are hard or not 
too soft. 

(47) Lemou, Mehats, and Raphael (2011). Orbital sta- 
bility of spherical monotone equilibria of the gravita- 
tional Vlasov-Poisson equation; the culmination of a 
long series of works on the stability of the Vlasov- 
Poisson equation by Antonov, Wolansky, Strauss, Guo, 
Rein, and others. 

(48) Mouhot and Villani (2011). A proof of Landau 
damping for the nonlinear Vlasov equation, near sta- 
ble homogeneous equilibria, in analytic or Gevrey regu- 
larity, via phase mixing and gliding regularity; this was 
later adapted by Bedrossian and Masmoudi to inviscid 
damping near Couette flow. 


(49) Mischler and Mouhot (2013). Significant progress 
on Kac’s program: relaxation estimates for particle sys- 
tems, quantitative and uniform in the number of parti- 
cles, in the limit of the spatially homogeneous Boltz- 
mann equation, using quantitative chaos properties 
and entropic estimates. 

(50) Escobedo and Velazquez (2013). A rigorous proof of 
blow-up (Bose-Einstein condensation) for the quantum 
spatially homogeneous Boltzmann-Bose equation. 

8 Challenges 

While kinetic theory has come a tremendous distance, 
the field is still driven, among other motivations, by the 
ambition to understand certain famous, monstrously 
difficult problems. A list of some of these problem 
follows below, gathered together under a few main 
themes. 

8.1 The Cauchy Problem, Regularity, Singularities, 
and Finite-Time Qualitative Behavior 

The most important and annoying open problems in 
this list are certainly the related questions of the reg- 
ularity and well-posedness of the Boltzmann equation 
when no perturbative or spatial homogeneity assump- 
tions are imposed. Parallels could be drawn between 
this and the Millennium Prize Problem on the incom- 
pressible Navier-Stokes equation in three dimensions. 
For collisionless kinetic equations, even if the Cauchy 
problem for the Vlasov-Poisson equations has been 
tamed, other Flerculean tasks concerning more intri- 
cate models remain. The Cauchy problems for the 
Vlasov-Maxwell and Vlasov-Einstein equations are of 
particular interest. Actually, for these two equations 
even the perturbative theory is far from well under- 
stood. In a completely different direction, the stability 
of homogeneous solutions remains almost untouched 
in the theory of the inhomogeneous inelastic Boltz- 
mann equation; the annoying issue here is that nobody 
has been able to prove that clustering is possible, which 
is well accepted in physics. Finally, much remains to be 
understood about very soft interactions (when the col- 
lision kernel behaves like a large negative power of the 
relative velocity). 

8.2 Long-Time Behavior 

The entropic relaxation for the Boltzmann equation 
now seems quite well understood without boundaries, 
with robust estimates and recipes applying far from 
equilibrium as well as optimal decay from quantitative 
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linearized arguments. Collisionless relaxation in Vlasov 
theory, on the other hand, is understood only near 
a stable homogeneous equilibrium, and the stability 
of inhomogeneous stationary solutions, such as so- 
called BGK waves (named after Bernstein, Greene, and 
Kruskal), remains a famous open problem. The math- 
ematical theory of instability phenomena is also wide 
open. The study of the long-time behavior of “typical” 
data, e.g., via a statistical approach, is untouched. 

A long-term goal is the combination of entropic and 
mixing effects in the study of convergence, e.g., for 
the so-called Vlasov-Poisson-Fokker-Planck equations, 
which combine mean-field mixing and hypoelliptic dif- 
fusion. 

For dissipative equations with nonreversible station- 
ary states, corresponding to nonlocal cancelations in 
the equation, long-time study is in its infancy. For non- 
linear models, even when the existence of equilibria is 
proven, there is generally no Lyapunov-type approach 
to the long-time behavior. 

8.3 Meso-Macro Limits: Hydrodynamic Limits 

Huge progress has been made in understanding the 
hydrodynamic limit of Boltzmann equations since this 
problem was expressed by Hilbert more than a century 
ago. And yet many questions remain unanswered. The 
incompressible limit is now rather well understood, 
but this limit is quite specific, and one would like to 
understand the more natural compressible limit better. 
Examples of problems in this area include the long-time 
stability of Boltzmann solutions near a smooth solu- 
tion of compressible Navier-Stokes equations, and the 
handling of shocks in the large. 

As already mentioned, the hydrodynamic limit of the 
Boltzmann equation leads only to perfect fluids, unless 
the equations are modified in a phenomenological way. 
To retrieve alternative pressure laws from basic prin- 
ciples of classical mechanics, the most natural plan 
is to go directly from the equations of microscopic 
many-particle systems to hydrodynamic models, with- 
out passing through the mesoscopic scale. This strat- 
egy was first made precise in a program sketched by 
Morrey in the 1950s that proved to be extraordinarily 
difficult and is still largely open in spite of substantial 
progress by Varadhan, Yau, Olla, and others. 

8.4 Microscopic Derivation 

The derivation of kinetic models from the laws of atom- 
istic matter is also part of Hilbert’s sixth problem, and 


it is an emblematic issue in both kinetic theory and 
statistical physics. 

In the collisional case, the most important open prob- 
lem is certainly the validity of the Boltzmann-Grad 
limit for hard spheres in large time (i.e., in time sig- 
nificantly larger than the mean free time) and without 
any assumption of very small mass. This problem prob- 
ably includes an understanding of the regularity of the 
inhomogeneous Boltzmann equation, so it can be con- 
sidered a Holy Grail in the field. Another open problem 
is the low-density limit in the case of long-range colli- 
sional interactions; in this case, not even a short-time 
result has been established. 

In the collisionless case, the main open problem is the 
rigorous justification of the mean-field limit in the case 
of Coulomb and Newton interactions. The best results 
so far were obtained by Hauray and Jabin around 2007, 
but they still require smoothing or cutoff of the inter- 
action at small scales. A further goal is the understand- 
ing of the microscopic derivation of the many involved 
models that appear in plasma physics, one instance 
being the Balescu-Lenard equation (for which even the 
short-time well-posedness is still unclear). 

Finally, in the case of diffusive kinetic equations, one 
of the most appealing open problems is the derivation 
of the heat equation from a set of interacting oscilla- 
tors, as studied, for instance, by Rey-Bellet. While pre- 
liminary works have established, among other things, 
the existence of relevant equilibria, the derivation of 
the heat equation has been understood only in partic- 
ular cases, with the help of hypoelliptic and hypocoer- 
cive tools. 

8.5 The Challenge of Boundary Conditions 

Questions about the interaction of gases with bound- 
aries or external forces were raised in the early days 
of kinetic theory by both Maxwell and Boltzmann. 
In the real world, most phenomena involving many- 
particle systems include nontrivial geometries, bound- 
ary effects, or external fields. But the Boltzmann and 
Vlasov equations are still poorly understood in this 
respect, in particular concerning the geometry-driven 
asymptotic behavior. Even for the hypoelliptic kinetic 
Fokker-Planck operators in a domain, there is no equiv- 
alent to the huge body of work on the eigenvalue prob- 
lem of the Laplace equation in a domain. The major- 
ity of the results and challenges discussed above lend 
themselves to boundary-driven formulations, which are 
mostly open. 
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Specific boundary-related problems raise beautiful 
challenges: propagation of singularities according to 
the shape of boundaries, ergodicity, relaxation to equi- 
librium, and so on. 

An even more ambitious goal is the understanding of 
self-induced nontrivial geometry', as observed in partic- 
ular in galactic dynamics, where the geometry of the 
confinement is influenced by the gravitational mean 
held of the system itself. 
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IV.26 Continuum Mechanics 

Richard D. James 


1 What Is Continuum Mechanics? 

Matter is composed of atoms. Atoms are composed 
of protons, neutrons, and electrons. Protons and neu- 
trons are composed of elementary particles. Every- 
thing is discrete and, at the finest level, indivisible. It 
is quite surprising, then, that continuum mechanics, 
possibly the most successful theory of general use in 
applied mathematics, does not explicitly recognize the 
existence of atoms. 

We will discuss the relationship between atoms and 
continuum mechanics later, but the subject is indeed 
used a lot; well over half the articles in this volume 
directly use one or another special case of the basic 
equations of continuum mechanics. Of course, these 
cases exist as subjects in their own domains, and the 
focus of these special cases may be on the beautiful 
phenomena they describe, as one can see. Elowever, 
there is a structure that is common to all of them. From 
the perspective of continuum mechanics, theories that 
appear to be completely different become quite similar 
when viewed from within this structure. 

As such, continuum mechanics has the same advan- 
tage as any other unifying theory of mathematics: by 
knowing the structure, one can understand many spe- 
cial cases by remembering only a few key concepts. 
We can in fact go further: by knowing the structure, 


one can more easily discover new special cases. This 
activity usually takes the form of the discovery of new 
mathematical theories for emerging materials. 

Continuum mechanics usually gives rise to partial 
differential equations (PDEs). In modern research there 
is a healthy interaction between continuum mechanics 
and PDEs. Although linearization of these equations is 
possible and useful, the equations that arise are almost 
always nonlinear, and continuum mechanics has per- 
haps been the primary driving force behind the devel- 
opment of methods for solving nonlinear PDEs. Other 
subjects also give rise to PDEs, Maxwell’s equations in 
electromagnetism, for example. But electromagnetism 
describes the electric and magnetic fields between the 
atoms, as well as macroscopic fields, and it is not usu- 
ally considered a branch of continuum mechanics. On 
the other hand, micromagnetics — the theory of mag- 
netism that describes magnetic domains, the magneti- 
zation of an iron bar by an applied field, and the writing 
of a bit in magnetic recording— is a continuum theory. 

The three pillars of continuum mechanics are kine- 
matics, balance laws, and constitutive equations. The 
central philosophy behind the subject is to separate 
as much as possible the hypotheses that are satisfied 
by all materials (or, realistically, large classes of mate- 
rials) from those hypotheses that pertain to special 
materials, like elastic solids, viscous fluids, or magnetic 
materials. 

Part of the reason continuum mechanics does not 
explicitly recognize the presence of atoms is that 
its main structure was described before the exis- 
tence of atoms was accepted. The founder of con- 
tinuum mechanics was Euler, and he conceived its 
main assumptions in the period 1740-60, half a cen- 
tury before Dalton’s vague inferences about atoms. 
Other early contributors to continuum mechanics were 
Cauchy and Kirchhoff, and also Hooke, Navier, Poisson, 
Stokes, Maxwell, Saint-Venant, Kelvin, Gibbs, Duhem, 
and others. There was a resurgence of interest in the 
subject in the late 1940s and 1950s, coincident with 
the rise of materials science and polymer chemistry, 
as new solids and fluids emerged that were clearly not 
described at all well by the then-known equations of 
mechanics. This resurgence was led by Coleman, Erick- 
sen, Noll, Oldroyd, Markovitz, Reiner, Rivlin, Serrin, 
Toupin, and Truesdell (along with many others) and 
was also synergistic with the emergence of numerical 
analysis, scientific computation, and, as noted above, 
materials science and PDEs. Today, there is a second 
resurgence, and this is described below. 
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2 Tensor Analysis 

The conventional language of continuum mechanics is 
tensor analysis. To begin, this language concerns vec- 
tors. A vector in two or three dimensions is an arrow. 
Picture an arrow pointing at the sun, based at the cen- 
ter of Stonehenge, whose length is the intensity of light 
at noon on January 1, 2000. Its physical significance 
is clear, irrespective of how we describe it mathemat- 
ically. If you ask four people to describe this arrow in 
mathematical terms (without communicating with each 
other), one person might give a list of three numbers 
( a,b,c ). Someone else might write numbers ( d,e,f ). 
Each will likely have chosen a different basis. Inspired 
by the shape of Stonehenge, maybe someone will have 
used a cylindrical polar coordinate system, with a list 
(r, 0, z) denoting the standard polar coordinates of the 
tip of the arrow with respect to that person's choice 
of basis. Someone else might simply say v. The pur- 
pose of tensor analysis is to give precise rules that 
relate ( a,b,c ) and its basis to (d,e,f) and its basis. 
The underlying principle is that everybody describes 
the same arrow. 

In continuum mechanics one often deals with vector 
fields. These are familiar from weather maps. In fact, 
continuum mechanics has a lot to say about the pat- 
terns of vectors on those maps. To describe them quan- 
titatively one needs a basis erected at each point on the 
map. Of course, each of these bases could be the same 
(up to their choice of origin); this case is called the nat- 
ural basis of a Cartesian coordinate system. There are 
ways of constructing “natural” bases associated with 
other coordinate systems, i.e., systems of linearly inde- 
pendent vectors, erected at each point, that are parallel 
to the coordinate curves. Tensor analysis deals in an 
automatic (though somewhat laborious) way with this 
case too, giving laws for transforming lists (a, b, c) that 
now depend on each point in R 3 and a choice of basis 
at each point to new lists for another field of bases on 
R 3 . Common vector fields in continuum mechanics are 
position, velocity, acceleration, vorticity, and traction. 

Typically, the vector held on the weather map (the 
velocity held) satishes some equations of continuum 
mechanics. In general, the set of equations satished by 
those arrows is exceedingly complicated and actually 
not fully known. That is because, to construct those 
vector helds from measurements, there is a tremen- 
dous amount of averaging going on, the winds are 
decidedly turbulent, and the modeling of evaporation 
and condensation as occurs in clouds is not fully 


understood. Nevertheless, if one did know these equa- 
tions precisely, no matter how complicated their forms 
or the methods needed to solve them, they would have 
the following property that is shared by all equations 
of continuum mechanics: if the components of the vec- 
tor held with respect to one coordinate system satisfy 
these equations, and one changes the basis held, then 
the form of the equations has to change in just the right 
way to ensure that the components in the new basis 
held automatically satisfy the new equations. 

As a simple example, the arrows on the weather map 
may approximately satisfy div v = 0. This is dvi/dxi + 
dv 2 ldx 2 = 0 expressed in the natural basis of a rectan- 
gular Cartesian system. The same vector held (the same 
arrows!) expressed in the natural (orthonormal) basis 
of a polar coordinate system, (v r (r, 0), vg(r, 0)), then 
automatically satishes “the equation div v = 0 in polar 
coordinates,” namely, d(rv r )/dr + dvg/dO = 0. These 
two PDEs describe the same property of the arrows on 
the map. 

The same ideas apply to linear transformations. The 
analogue of an “arrow” for a linear transformation con- 
sists of two pictures: a unit cube and a parallelepiped, 
together with a rule that says which corner of the cube 
goes to which corner of the parallelepiped. In short, a 
linear transformation is a cube-parallelepiped rule. In 
cases in which the linear transformation is not invert- 
ible, the parallelepiped might degenerate; it might lie 
in a plane, for example. 

We can describe any linear transformation quanti- 
tatively by introducing a basis, an orthonormal basis 
aligned with the cube we have chosen, say. Let us say 
that our rule says that the origin goes to itself (i.e., 
there is no translation). The vector associated with the 
(1,0,0) corner of the cube is transformed by our rule to 
a vector with components, say, (a, b, c). Similarly, say 
(0,1,0) — ■ ( d,e,f ) and (0,0,1) — (g, h, i). The linear 
transformation is then represented by a matrix 


F = 


' a 
b 

\ c 


d 


e 

f 


V 


Matrix multiplication of F on any vector (expressed in 
the same basis) gives the components of the vector 
in the parallelepiped to which it is deformed (in the 
same basis). Just as for vectors, the important point is 
that many people will find many different matrices by 
choosing different bases, aligned with the cube or not 
aligned with it, but they all have to describe the same 
cube-parallelepiped rule. 
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In continuum mechanics, linear transformations are 
often called tensors. Continuum mechanics has many 
tensors and, like vector fields, they can depend on 
position (“tensor fields”). Typical examples are the 
deformation gradient and stress tensor. 

The bottom line? All these components and bases 
and covariant components and Christoffel symbols at 
the beginning of a continuum mechanics course can 
be pretty daunting. Do they matter? Not really. The 
equations must already have the property of invariance 
under a change of coordinate system that was illus- 
trated above through the example divv = 0. A vec- 
tor is an arrow and a tensor is a cube-parallelepiped 
rule. You can always just use one particular rectangular 
Cartesian coordinate system, say, and its natural basis 
field. In that particular system vectors are represented 
by lists and tensors by matrices. A simpler and more 
elegant approach is to think of a vector as an arrow and 
write v. Vectors are combined with each other, and with 
scalars, according to the rules of a vector space. If v(x), 
x G Q c R n , is a vector field, its gradient V v is a tensor 
field. Every calculation, every computation, can be done 
using only the abstract rules for derivatives and vec- 
tor spaces. Some researchers in continuum mechanics 
(myself, for example) do everyday calculations in this 
way, never writing a component. In this article we will 
write the equations of continuum mechanics in both an 
abstract way and in a rectangular Cartesian coordinate 
(RCC) system for convenience. 

3 The Essential Structure of 
Continuum Mechanics 

3.1 Kinematics 

Kinematics is the geometry of motion. In continuum 
mechanics, which does not explicitly recognize the 
presence of atoms, motions of bodies are represented 
by functions. These are often assumed to be smooth, 
though some of the most important branches of con- 
tinuum mechanics involve the study of the singulari- 
ties of motions. Motions are fundamental to continuum 
mechanics because they can be studied independently 
of the material from which the body is made. 

There are two ways of describing motions: Eulerian 
and Lagrangian. The terminology is standard but inac- 
curate: Euler introduced the Lagrangian description, 
while d’Alembert and Daniel Bernoulli introduced that 
called Eulerian! The Lagrangian description of motion 
is a natural generalization of the description of the 



Figure 1 The Lagrangian description of motion. 


motions of individual particles. We name the parti- 
cles l, ... ,N and describe the motion of each particle 

as a vector-valued function of time: yi(t) yjv(t), 

t > 0, say. The vector yi(t*) is the position vec- 
tor of particle 1 at time t*. Inching closer to contin- 
uum mechanics, we could equally well use the notation 
y(l,t), . . . ,y(N,t), t > 0, for the same thing, where 
y : {1 JV} x (0, oo ) — JR 3 . The Lagrangian descrip- 

tion of motion in continuum mechanics allows the set 
of “particles” to belong to an open subset of JR 3 , so that 
the motion is described by y. Q x (0, oo) — R 3 , where 
Q c R 3 . In this form, y(x, t) is the position of the par- 
ticle x at time t. One can think of “the particle x" as a 
small lump of solid or fluid, but more about that later. 
The choice of Q is essentially arbitrary — it just serves 
as a way to label particles-— but people often choose it to 
be the shape of the body at t = 0, i.e., y (13, 0) = 13. If a 
motion does not depend on t, it is called a deformation. 

A picture of the Lagrangian description of motion is 
shown in figure 1. This description is particularly used 
in solid mechanics, as boundary conditions are usually 
idealized as the pushing or pulling of certain particles 
on the boundary. 

Assuming that the motion is sufficiently smooth, the 
velocity is y = dy/dt. It has exactly the same interpre- 
tation as in particle mechanics: y(x, t) is the velocity 
of the particle x at time t. 

Motions in the Lagrangian description are always 
assumed to be invertible. That is, the mapping y(-,t): 
13 — ■ R 3 is invertible at each fixed t. The inverse, 
y~ 1 (y, t), y 6 Qt, is defined on the moving domain 
13 t = y(Q,t). The failure of invertibility, i.e., the pos- 
sibility that y(x i, t) = y(x 2 , t) for x\ f X 2 , would be 
interpreted as the interpenetration of matter. 

Now, using invertibility, construct the function 

v(y,t) = y(y~ 1 (y, t),t). 
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In words, this is the velocity of the particle x = 
y- 1 (y, t) at time t; that is, it is the velocity of the partic- 
ular particle that happens to be at the location y at the 
same time t. This is the Eulerian description of motion, 
and v(y, t) is the Eulerian velocity field. An example of 
an Eulerian velocity held is the arrows on the weather 
map, so long as we can reasonably associate a partic- 
ular time with the map, i.e., so long as the velocities 
represented by the arrows were measured at the same 
time. Sometimes one sees weather videos in which the 
arrows change with time during a day, showing how the 
winds are changing; this is a direct visualization of the 
Eulerian velocity held. Above, we calculated the Eule- 
rian description from the Lagrangian. The reverse can 
be done by noting that, by the definition of v(y, t) given 
above, the Lagrangian description y(x, t) satishes the 
ordinary differential equation 

y(x,t) = v(y(x,t),t), 

y(x, OI=xefl, 

where we have conveniently chosen Q to be the shape 
of the body at f = 0. This is an ordinary differential 
equation in standard form, with x playing the role of a 
parameter. An interesting feature of this ordinary dif- 
ferential equation is that, even if the pattern of arrows 
in three dimensions is quite simple — if, for example, v 
is a low-order polynomial and has no explicit depend- 
ence on t— the solution y(x, t) can be exceedingly com- 
plicated with highly intertwined orbits. (See the article 
on dynamical systems [IV.20] for more on this.) Peo- 
ple have developed this idea into a theory of the mixing 
of substances, like fluids or granular materials. 

Many critically important kinematical quantities are 
calculated from the Eulerian or Lagrangian descrip- 
tions. We mention a few that appear later in this article. 
F = Vy (or, in RCC, Fij = dyi/dxj) is the deformation 
gradient. We can see directly from the abstract formula 
for the gradient, 

y(x + ez, t) - y(x, t) = eVy(x , t)z + °(e) 

= EFZ + o(f), 

that y(-,t ) deforms a tiny cube centered at x to a tiny 
parallelepiped centered at y(x), these being scaled ver- 
sions of the cube-parallelepiped rule associated with 
the tensor F (scaled by e ). As a local statement of invert- 
ibility it is always assumed in continuum mechanics 
that the parallelepiped is oriented and has positive vol- 
ume: detF > 0. In short, F describes local deformation. 
F has the polar decompositions [IV.10 §2] F = RU = 
VR, where R is the rotation tensor (R 1 R = I, det R = 1) 


and the positive-definite symmetric tensors U and V 
are the right and left stretch tensors, respectively. Again 
using the formula for the gradient above, we can think 
of the motion (at fixed time) as locally involving first 
stretching of the tiny cube by U and then rigid rota- 
tion by R to achieve the tiny parallelepiped associated 
with F. If the edges of the cube happen to be oriented 
along the eigenvectors of U, then this initial stretching 
by U produces a rectangular solid rather then a general 
parallelepiped, which is subsequently rotated by R. In 
continuum mechanics people often associate with F a 
sphere-ellipsoid rule rather than a cube-parallelepiped 
rule. The resulting ellipsoid is called a strain ellipsoid. 
Its principal axes are eigenvectors of the left stretch 
tensor V. 

From the Eulerian description we get other kinemat- 
ical quantities more typically used in fluid mechanics. 
G = Vv (or, in RCC, Gy = dvi/dyj) is the velocity gra- 
dient. Its symmetric part D = ^ (G + G T ) (or, in RCC, 
Dij = \(Gij + Gji) = \{dVildyj + dvjldyf)) is called 
the stretching tensor or the strain-rate tensor. It plays 
a central role in fluid mechanics. These tensors can 
be given physical interpretations in terms of instant- 
aneous stretching of a small cube in space, as above. 
The key word here is “instantaneous” because the Eule- 
rian description describes the velocity v(y, t) of a par- 
ticle located at y at time t. A short time later, t + 5, 
it is a different particle at y whose velocity is given by 
v (y,t + S). 

3.2 Balance Laws 

The balance laws of continuum mechanics express the 
fundamental conservation laws of mass, momentum, 
and energy. The reason these are so central to contin- 
uum mechanics is that they can be stated in ways that 
are independent of the constitution of the body, just as 
Newton’s law/; = my, holds for a particle i of mass m, 
regardless of whether it models a steel ball or a droplet, 
or whether the force /; is produced by water resistance 
or air resistance. 

The balance of mass is straightforward. We intro- 
duce a positive mass density po : FI — R on Q with the 
interpretation that 

p 0 (*)dx, 

JD 

or, in RCC, JjJ po(*i,* 2 , X 3 ) d%i dX 2 dxs, ( 2 ) 

is the mass of D. Moving something around, or deform- 
ing it, even severely, does not change its mass, at least 
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in classical mechanics. Therefore, since yCD,t) con- 
sists of the same particles as T), its mass at every time 
t must also be given by (2). To state this law it is also 
useful to introduce a mass density p(y,t), y e fi t , 
on the deformed region fit = y(Q,t). This mass den- 
sity changes with time because the material is gener- 
ally compressed or expanded as it deforms. The bal- 
ance of mass is the statement that the mass of D never 
changes: 

p 0 (x)dx = p(y,t)dy 

JV JyCD.t) 

for all T> c Q, t > 0. 


We see that the right-hand side is in exactly the right 
form to use the motion itself as a change of variables, 
y — ■ x: 


p 0 (x)dx= p(y,t)dy 

JT> Jy(D,t) 


yCD.t) 

f p(y(x, t),t)Jdx, 
Id 


(3) 


where J is the Jacobian of the transformation y — x; 
that is, J = |detF| = detf, where F = Vy is the defor- 
mation gradient introduced above. Combining the left 
and right of (3) we get 


(p(y(x, t),t)J - po(x))dx = 0, (4) 

In 

which must hold for all domains D c Q and all t > 0, 
say. Suppose that for each fixed t the integrand of (4) 
is continuous on the open region Q. Arguing by con- 
tradiction, suppose that the integrand is nonzero, say 
positive, when evaluated at some particular xo & Q and 
t o > 0. We fix t = to and choose D to be a ball of radius 
r centered at Xo, with r sufficiently small that this ball 
is contained in Q and that the integrand is positive on 
this ball. Then, of course, the integral must be positive, 
contradicting (4). The conclusion is that the integrand 
must be zero at all xo £ Q and all t > 0. This is the 
local form of the balance of mass: 


p(y(x,t),t)J(x,t) = po(x), 

or, briefly, pj = p 0 . (5) 

The line of argument just presented, which is called 
localization, is common in continuum mechanics. It per- 
mits the passage from statements summarizing laws 
satisfied “by all subregions” to differential equations. 
The assumption of continuity of the integrand can be 
considerably relaxed by using Lebesgue’s differentia- 
tion theorem. 

Careful differentiation of the statement in (5) with 
respect to t, using the chain rule and the formula for 


the differentiation of a determinant, yields 
) det Vy 


dp 

J^+JVyP-y 


dt 


■ + JVp ■ y + pJVy T ■ Vy 

: 0 . ( 6 ) 


If we now differentiate the fundamental relation (1) 
between the Eulerian and Lagrangian descriptions with 
respect to x, we see that the term pJVy~ T ■ Vy in (6) 
can be simplified to pj div y v. Dividing the result by 
J > 0 and replacing x by y (y, t) everywhere, we get 
the Eulerian form of the balance of mass: 


^ + V p ■ v + p divn = ^~ + div(pt/) = 0, 
dt dt 

or, in RCC, ^ + X = 0. (7) 

dt dyt 

The independent variables in (7) are y and t and (7) 
holds on fit x (0, oo). One can see that it is critically 
important in continuum mechanics to keep track of 
independent and dependent variables and domains of 
functions. 

The condition divn = 0 mentioned earlier in the 
context of weather maps can then be seen as a conse- 
quence of the balance of mass in the Eulerian descrip- 
tion together with the assumption that the density p 
is a constant. Materials for which all motions have 
constant density are called incompressible materials. 

We will not repeat the arguments above for other 
laws, but they follow a similar pattern of introducing 
“densities,” using the motion as a change of variables, 
localization, and passing between Eulerian and Lagran- 
gian descriptions. For example, in the Lagrangian and 
Eulerian forms the balance of linear momentum is 
respectively given by 
d f dy 

— po^r d* = force on yCF, t), 
at Jt ot 

4r f Pwp ^y = force on y(T, t). (8) 

dr I ycp.t) ot 

In these two cases there are various useful expressions 
for the “force on yCF, t).” We v\411 focus on the form 
typically used in the Eulerian description, and we v\411 
omit so-called body forces: 

force on y(T, t) = | tdA. (9) 

Jdy<T,t) 

The integrand t is called the traction. From this formula 
one can see that it represents the force per unit area on 
the boundary of the region yCP,t). 
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In principle, the traction could depend on lots of 
things. Certainly, it depends on the material and the 
motion. It is likely to be generally different at different 
points in D t . One could also imagine that it depends 
on the surface S = dyCP, t) in some complicated way. 
Fix t and fix a point y & Q t . Imagine lots of surfaces 
passing through y. How are the tractions at y on these 
different surfaces related? 

It was a brilliant insight of Cauchy to see that for a 
broad range of materials, both fluids and solids, the 
traction at y has a particularly simple dependence on 
the surface passing through y . Cauchy’s starting point 
was a plausible expression for t with a rather gen- 
eral dependence on S, from which he deduced, using 
directly the balance of linear momentum in Eulerian 
form, that the dependence of the traction on S is 
through the unit normal only and that this dependence 
is linear: t = Tn (or, in RCC, t* = X/ 7yn/). The ten- 
sor T is called the stress tensor. This part of continuum 
mechanics is called the theory of stress. (See mechanics 
of solids [IV. 32 §2.3] for a helpful physical description 
of stress.) People do sometimes worry about Cauchy’s 
starting point, i.e., whether some kind of exotic mate- 
rial might transmit forces across a surface in a more 
general way than by having just a linear dependence of 
the traction on the normal to the surface. But, gener- 
ally, Cauchy’s conclusion has been found to be widely 
applicable. 

If we insert t = Tn into (9), then use the divergence 
theorem in Eulerian form in (8), and finally localize, we 
get the local form of the balance of linear momentum: 



divT, 


or, in RCC, 





( 10 ) 


There are two more conservation laws: the balance 
of rotational momentum and the balance of energy. 
The former leads to the symmetry of the stress ten- 
sor, T = T J (or, in RCC, Ty = Ty), and the latter is, in 
its simplest local form, 

+ Ve ■ = - divq + tr(TD), 

where e is the internal energy and q is the heat flux. 

Finally, there is a formulation of the second law 
of thermodynamics in continuum mechanics. It is 
most commonly represented by the Clausius-Duhem 
inequality. This inequality embodies in some way the 
fundamental statements of the second law, such as 



it is impossible for a body to undergo a cyclic pro- 
cess that does work but emits no heat, which goes 
back to Carnot. In continuum mechanics the Clausius- 
Duhem inequality plays two important roles. The first 
is restricting constitutive relations so that they do not 
allow behavior as indicated by the italicized statement 
above; for instance, the restricted relations do not per- 
mit the existence of a cyclic energy conversion device 
that produces electricity while completely immersed 
in a container of hot water. Needless to say, the pre- 
cise form of these restrictions is exceedingly impor- 
tant these days. The other is as a restriction on pro- 
cesses for given constitutive relations. The latter is best 
represented by the theory of shock waves, where the 
Clausius-Duhem inequality declares some shock wave 
solutions to be “inadmissible.” 

3.3 Constitutive Equations 

Everything we have said so far holds for broad classes 
of materials, both solids and fluids. Since each new 
equation we have introduced has introduced at least 
one new unknown function, none of the equations we 
have written could be used to predict anything. There 
have to be special equations that quantify the behavior 
of special classes of materials. These are called constitu- 
tive equations. Discovering a simple constitutive equa- 
tion that, when combined with the balance laws, pro- 
vides a good description of a material in a broad class 
of motions is a big success in continuum mechanics. 

Often, constitutive equations take the form of a for- 
mula that relates the stress to the motion. Usually, this 
formula contains some constants that actually spec- 
ify the material. These material constants have to be 
measured experimentally for each particular material 
described by the constitutive equation, and continuum 
mechanics has a lot to say about the design of these 
experiments. An example of a constitutive relation is 
that for the Navier-Stokes fluid, which is defined by 

T = -pI + 2p(D- | (tr D)I), 
or, in RCC, Ty = -pSij + 2p (fly - ’ (X D kk) <5 y) , 

( 11 ) 

where D is the stretching tensor introduced above, p 
is the pressure, and p > 0 is the viscosity. An impor- 
tant related constitutive relation is the one for the 
incompressible Navier-Stokes fluid, which is defined by 

T = -pI + 2pD. (12) 
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The absence of the last term of (11) in (12) is due to the 
condition of incompressibility, which, as noted above, 
gives divu = trD = 0. But also, p means something 
different in (11) and (12). In the former it is usually 
specified as a function of the density— given in the 
simplest case by the ideal gas law— while in the lat- 
ter it is treated as an independent function, unrelated 
to the motion. In equations of motion for an incom- 
pressible Navier-Stokes fluid, ply) becomes one of the 
unknowns, to be determined as part of the solution of 
a problem. This treatment of p is a consequence of a 
general theory of constrained materials in continuum 
mechanics that includes incompressible materials but 
also treats other kinds of constraints, such as a material 
being inextensible in a certain direction. 

If we substitute (12) into the balance of linear mo- 
mentum (10), we get the famous Navier-Stokes equa- 
tions: in RCC, 



These equations represent one of the profound suc- 
cesses of continuum mechanics. Typically, one nondi- 
mensionalizes the first equation by scaling space by £ 
and time by t (these being a length and a time aris- 
ing in a problem of interest) and dividing by the (con- 
stant) density p, so that the resulting equation con- 
tains a single nondimensional material constant, Re = 
p{£ IT)£ I p, the Reynolds number. With only one mate- 
rial constant, Re, an enormous body of quantitative 
observation on the behavior of liquids (and even gases 
under some circumstances) can be understood with 
remarkable precision. 

This does not mean that any fluid behaves exactly 
as predicted by the Navier-Stokes theory in all circum- 
stances. Any fluid, if compressed enough, will become 
a solid or, if subjected to a sufficiently strong elec- 
tric field, a plasma. Electrons can be ripped off nuclei 
and nuclei can be split, none of which is described 
by the Navier-Stokes theory. What is prized in con- 
tinuum mechanics is not the reductionist’s quest for 
“truth” but the elegance that comes with discovering an 
underlying simplicity that reveals many phenomena. 

Alas, solids are not so simple, but much behavior can 
be understood from a constitutive equation of the form 

T = f(F) = Rf(U)R T , (13) 


where F = RU is the polar decomposition discussed 
above. This constitutive equation describes a (nonlin- 
ear) elastic material. T(U), a symmetric tensor-valued 
function of a symmetric tensor, can be pretty compli- 
cated; in RCC, T(U) is six functions, each of six vari- 
ables. On the other hand, this constitutive relation cov- 
ers an enormous range of behavior that would intu- 
itively be considered “elastic” — as well as, incidentally, 
some interesting behavior that would not be considered 
elastic. 

There are lots of ideas that are used to simplify 
the constitutive relation (13) of elasticity. One of the 
most powerful is symmetry. If one deforms a ball S 
of rubber by a deformation y. B — R 3 , one gets a 
certain stress held. If we know the constitutive rela- 
tion, we get this stress held by using the formula 
T(x) = R(x)T (U ( x))R(x) t , where V y = R(x)U(x) 
is the polar decomposition at each x 6 B. If we take 
this ball of rubber, rotate it rigidly in any way, and, 
after doing so, place it exactly in the region B, and 
deform it again using y. S — M 3 , we will in general 
get a different stress held. For example, if y : B — ■ R 3 
primarily describes an extension in a certain direction, 
say “up,” and we happen to rotate it so that a stiff 
direction of the rubber ball is oriented up, then we 
expect to have to exert larger forces to give it exactly 
this same deformation. In fact, rubber and many other 
materials are often well described by the assumption of 
isotropy. This means that if one undertakes the exper- 
iment described here, one gets exactly the same stress 
held, and this condition holds regardless of the rota- 
tion or subsequent deformation. The assumption of 
isotropy is exploited by phrasing all the steps in this 
paragraph in mathematical terms. The result is as fol- 
lows: a nonlinear elastic material is isotropic if there 
are three functions qpi (I, II, III), qp2 (I, II, III), cp 3 (I, II, III) 
such that 

T = qp\I + q?2B + epj.B 1 , (14) 

where B = V 2 = FF J , and where I = trB, II = |(tr B 2 - 
(tr B) 2 ), and III = detB are the principal invariants of 
B. At hrst, it may look like (14) is not a special case of 
(13), but it is; notice that by using the polar decomposi- 
tion, B = FF J = RU 2 R t . The form of (14) might suggest 
that some kind of Taylor expansion is involved, but this 
is not the case. The constitutive equation (14) holds 
for arbitrarily large deformations of isotropic elastic 
materials. 

There is a set of general principles in continuum 
mechanics that are used to simplify constitutive rela- 
tions. Perhaps the most powerful, and controversial, 
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of these is the principle of material frame indifference 
(PMFI). We will not explain this principle in detail, but 
we will note that the particular kinematics used in the 
constitutive equations given here (notice the appear- 
ance of D in the Navier-Stokes relation, rather than 
simply Vu, and the unexpected explicit dependence on 
R on the right-hand side of the constitutive relation 
(13) of elasticity) are a direct consequence of the PMFI. 
Even Stokes, when deriving the Navier-Stokes equa- 
tions (building on work of Navier, Cauchy, Poisson, and 
Saint-Venant), did not begin with (12). Rather, he began 
with the hypothesis that the stress is affine in the full 
velocity gradient V v rather than in its symmetrization 
D. He then argued in words (using what can be recog- 
nized as a verbal, and less precise, form of the PMFI 
than is now accepted) that the stress should actually 
be affine in h = j(Vu + Vn T ). The ongoing contro- 
versy surrounding the PMFI is not so much concerned 
with its usefulness in continuum mechanics, which is 
nearly universally accepted, but rather with the absence 
of a direct experimental test of its validity (in the gen- 
eral case) and its obscure relation with atomistic and 
relativistic theories. 

4 Phenomena 

Even when it is restricted to the two constitutive rela- 
tions given above, continuum mechanics explains a 
diverse collection of phenomena. It often seems to have 
applicability to materials on length or timescales that is 
quite unexpected based on our current understanding. 

It is hopeless to try to give a representative glimpse 
of phenomena predicted by continuum mechanics. 1 
will instead give two examples of phenomena that have 
been predicted as a direct result of research in the fun- 
damentals. They are old examples, having emerged in 
the 1950s during the resurgence of interest in contin- 
uum mechanics that occurred in that period, but they 
retain their vibrancy today. 

Under ordinary conditions water is described well as 
an incompressible Navier-Stokes fluid. Take a cup of 
water and vertically insert a rotating rod (of, say, 1 cm 
in diameter) spinning at a modest rotation rate (a few 
revolutions per second, say). By making some symme- 
try assumptions and undertaking some modest sim- 
plifications, this problem can be solved. Even without 
symmetry assumptions this problem can be solved to a 
quantifiable level of accuracy by any of several numer- 
ical methods that have been developed for the Navier- 
Stokes equations. The answer is the expected one. As 



Figure 2 The rod climbing of a viscoelastic fluid. 

the water near the rod rotates, it is thrown outward. 
This causes the surface of the water to distort, resisted 
by the force of gravity, which tends to make the sur- 
face flat, and by viscosity, which tends to smooth the 
velocity field. At steady state the water level is slightly 
depressed near the spinning rod. 

Entirely different behavior is observed when the fluid 
is viscoelastic. This subset of fluids includes many that 
have long-chain polymer molecules in solution, from 
paints to pancake batter. When one does the same 
experiment described above with these fluids, the fluid 
in fact climbs up the rod, as first demonstrated by Karl 
Weissenberg during what must have been a memorable 
meeting of the British Rheologists’ Club in 1946; see fig- 
ure 2, which is a sketch based on one of Weissenberg’s 
early demonstrations. With a small cup of liquid and a 
strongly viscoelastic liquid, more than half the cup of 
liquid can climb up the rod after a short time, defying 
the force of gravity. 1 

There are many related examples; a solid cylinder 
falling in a container of Navier-Stokes fluid will turn 
so its axis becomes horizontal. In a viscoelastic fluid it 
turns so its axis is vertical! 

What force pushes the fluid up the rod, against grav- 
ity? This question puzzled continuum mechanicians, 
especially Markus Reiner, the founder of the science 


1. Commenting on a draft of this article, Oliver Penrose asked, “Is 
that why paints are so messy?” 



454 


IV. Areas of Applied Mathematics 


of rheology, and Ronald Rivlin. Their first hypothe- 
sis was quite natural; guided by the developing prin- 
ciples of continuum mechanics, particularly the then- 
available form of PMFI, they theorized that a natural 
generalization of the Navier-Stokes fluid, 

T = -pi + oqD + CX2 D 2 (15) 


with op, 0(2 functions of the principal invariants of the 
stretching tensor D , would work for viscoelastic fluids. 
In fact, the formal similarity between (14) and (15) is no 
accident: the same mathematics is used, but the physi- 
cal principles are different. For the former it is isotropy; 
for the latter it is the PMFI. 

It cannot be overestimated how natural (15) is. Not 
only is it an obvious generalization of the Navier-Stokes 
fluid, by including nonlinear terms in D, but it is in fact 
the most general form of the relation T = f(D) that 
is compatible with PMFI. Its only deficiency is that it 
turned out to be wrong! The Reiner-Rivlin relation does 
not describe Weissenberg's observations well. It turned 
out that the stress at time t in a viscoelastic fluid is 
sensitive to its deformation at past times, longer ago 
than is captured by the first time derivative in D. The 
mathematical formulation of this idea was developed 
by many researchers, and there were many twists and 
turns along the way, including the observation by Pip- 
kin and Tanner in 1969 that the standard method for 
interpreting measurements of traction on the boundary 
of the fluid was flawed, polluted by just those forces 
that drive the viscoelastic fluid up the rod. The result 
was that lots of measurements before that point were 
incorrectly interpreted. Now, of course, there are better 
constitutive relations, and the description of the “nor- 
mal stresses” that drive the fluid up the rod are pretty 
well understood, but the accurate description of the 
behavior of viscoelastic fluids remains an active area 
of research today. 

Another simple but influential example concerns 
the constitutive equation of isotropic elasticity (14). 
Consider the following deformation in RCC: 


y i(Xi,X 2 ,X3) = X] + KX 2 , 
y 2(Xi,X 2 ,X3) = x 2 , 


(16) 


y 3 (xi,x 2 ,x 3 ) = x 3 . 

This deformation is known as simple shear, and it is 
represented in figure 3 using two different reference 
configurations. One can think of the shearing of a rub- 
ber block Q aligned with this RCC basis, as shown in 
figure 3(a). The constant k is called the amount of shear. 
The components of the tensor B in this same RCC basis 


y: 12 — > R 3 



Figure 3 Simple shear of a nonlinear elastic solid. The same 
deformation is applied to two different reference configu- 
rations, O and f2', both of which represent relaxed con- 
figurations of a certain solid. The deformed configurations 
have been translated so as to be easily visible. Typically, 
Tn < 0 and T 22 < 0 in real materials, in which case these 
components of traction are compressive. 
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and the principal invariants I, II, III are particular func- 
tions of the amount of shear that are easily worked 
out. The coefficients qpi, qp 2, qp 3 are then functions 
of k whose form depends on the material. Using the 
constitutive equation (14), we can calculate the stress: 


(Til 

T21 

1 ° 


T12 0 \ 

T22 0 

0 r 33 , 


where 

T11 = qpi + (1 + K 2 )qp 2 + (k 2 + (1 + K 2 ) 2 )qp 3 , 
Til = T 21 = i<qp 2 + (2k + K 3 )cp 3 , 

T 22 = qpi + <P 2 + (1 + K 2 )qp 3 , 

T 33 = qpi + qp 2 + qpi- 


(17) 


(18) 


The balance of linear momentum (10) is satisfied for 
the motion (16) because the velocity is zero and, since 
the stress is independent of position, div T = 0. 
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A little study of (18) reveals that the stress compo- 
nents Tn, T 22 , and T 12 are not independent. In fact, by 
direct observation of (18), 

T 11 - T 22 = kT\ 2 , (19) 

which is a universal relation discovered by Rivlin 
(1948). It is called universal because it holds for all 
isotropic elastic materials. It contains no material con- 
stants. It is not obvious. 

It is illuminating to interpret the relation Tn - T 22 = 
kT\ 2 - For this purpose we use figure 3(b) and focus on 
the choice of reference configuration Q ' , arranged so 
that the deformed configuration is a rectangular solid. 
(The stress in the deformed configurations y(Q) and 
y(Q') is the same, but it is easier to understand this 
stress by calculating tractions on y(Q').) The compo- 
nents of traction (force per unit area) needed to hold the 
rubber block in the shape y(Q') are as shown. These 
are obtained from the general formula discussed above: 
t = Tn, with n chosen to be (1,0, 0) and (0, 1, 0). On 
the top surface there is a shear component of traction 
T 12 , as expected. Perhaps not so obvious is the fact that 
one generally needs a normal traction T? 2 . T 2 2 is usually 
negative, and, at large shears, it can be quite significant. 
Shearing a rectangular block of rubber horizontally typ- 
ically causes it to expand in the vertical direction; only 
by applying an appropriate compressive traction does 
the height remain the same. These results would have 
been surprising at the time they were derived, since 
only the linear theory of elasticity was widely used at 
that time, and the linear theory predicts that T 22 = 0 for 
an isotropic material. Rivlin's relation (19) says, quite 
unexpectedly, that the difference between the normal 
tractions on the right and top faces is determined by 
the shear traction and the amount of shear, and it 
is independent of the (isotropic elastic) material being 
sheared. It also shows that at least one of these normal 
stresses, Tn or T 22 , is quite large, of the order of the 
shear stress T 12 at k » 1. 

5 Current Research 

Continuum mechanics is entering a new period of activ- 
ity revolving around the main feature of matter that 
it suppresses: atoms. This development is due to the 
convergence of several factors. 

One is that, following the successes of viscoelastic 
fluids, nonlinear elastic materials, the theory of liquid 
crystals, and several other similarly important exam- 
ples, the discovery of broadly useful new constitutive 


relations has slowed. It is not that continuum mechan- 
ics has in any way lost its validity or applicability 
(the workhorse constitutive equations of continuum 
mechanics continue to find new and exciting applica- 
tions) but the underlying theory has been less sugges- 
tive of truly new directions. 

It is certainly true that materials science is produc- 
ing a dizzying array of new materials, with properties 
that are not even named in any treatment of contin- 
uum mechanics. Biology, too, is identifying new, often 
highly heterogeneous materials, the critical properties 
of which are not as yet described within continuum 
mechanics. Increasingly, what matters in these subjects 
is the presence of certain atoms, arranged on a lattice 
in a particular way, or a specific biological molecule. 
Alternatively, it could be a specific kind of defect in an 
otherwise regular structure that produces the interest- 
ing behavior. The biologist says, “I do not care so much 
how a generic lump of soft matter deforms; I want to 
know how matter containing this particular molecule, 
critical to life itself, behaves.” This attitude certainly 
turns the Navier- Stokes paradigm of “one Reynolds 
number needed to predict all behavior in all motions” 
on its head! 

In 1929, following the spectacular discovery of quan- 
tum mechanics, Dirac wrote: 

The underlying physical laws necessary for the mathe- 
matical theory of a large part of physics and the whole 
of chemistry are thus completely known, and the dif- 
ficulty is only that the exact application of these laws 
leads to equations much too complicated to be soluble. 

Perhaps this was a bit optimistic, but it remains true 
today that the laws needed at atomic scale to entirely 
predict macroscopic behavior that should be under the 
purview of continuum mechanics are known, and it is 
essentially a mathematical problem to figure out what 
are the macroscopic implications of atomic theory, with 
all its wonderful specificity. This problem is called the 
multiscale problem. Today, it is widely theorized that 
the solution of this problem will involve the identifica- 
tion of a certain number of length or timescales, with 
separate theories on each scale and input to each theory 
coming from the output of the one at the next lowest 
scale. The author is skeptical. 

The mathematical difficulties are easy to explain with 
examples. Consider just a single atom of carbon, atomic 
number 6, held at absolute zero temperature. To com- 
pute its electronic structure with quantum mechanics, 
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one needs to solve an equation for the wave function 
ip((x\,si),...,(x6,ss)). Si = ± 2 (the spins), describ- 
ing the positions of the electrons probabilistically, and 
depending on 3 x 6 = 18 independent spatial variables. 
If we modestly discretize each such variable by ten 
grid points, we obtain a mesh with 10 18 grid points! Of 
course, a single atom of carbon is open to the methods 
of a simplified version of quantum mechanics called 
density functional theory (DFT) as well as other meth- 
ods, and DFT would be considered accurate in this case 
for some purposes. But DFT has its own problems. Its 
status as an approximation of quantum mechanics is 
not understood beyond heuristics. And even for twenty 
carbon atoms, disturbingly, DFT does not get the most 
favorable geometries correct. 

These observations may make it seem hopeless to 
try to pass from full quantum mechanics to some kind 
of useable version of quantum mechanics for many 
atoms, never mind passing from quantum mechanics to 
continuum mechanics. But there are other suggestions 
of deep underlying simplicities in quantum mechanics 
that are neither exploited nor understood. One example 
is the so-called electron-to-atom (e/a) ratio. The impor- 
tance of e/a was recognized long ago by Hume-Rothery, 
and it was exploited in a particularly effective way in 
magnetism by Slater and Pauling. For an alloy such as 
Cu x -Mn v Sn z consisting of x atoms of copper, y atoms 
of manganese, and z atoms of tin, the e/a ratio is sim- 
ply the valences of Cu, Mn, and Sn weighted by their 
atomic fractions: 

e = X V Cu + y VMn + zVsn 
a x + y + z 

where Vc u , VMn, Vs n are, respectively, the numbers of 
valance electrons of Cu, Mn, and Sn. When macroscopic 
properties of materials are plotted against e/a, there 
is often a remarkable collapse of data. Of course, e/a 
is just one of the parameters that enters a quantum 
mechanics calculation under the Born-Oppenheimer 
approximation. Under this assumption the inputs to 
the quantum mechanics calculation are the positions 
and atomic numbers of the nuclei. Correlation with e/a 
means that somehow the positions hardly matter! The 
e/a ratio is often most successful in cases where the 
underlying lattices are somewhat similar as the concen- 
trations x, y, and z are varied. And it must be admit- 
ted that the definition of valence itself is the result of 
a (single-atom) quantum mechanics calculation. Never- 
theless, changing x, y, and z is a change of order 1: 
some neighbors of some atoms change from one ele- 


ment to another. And the correlation often persists 
if new elements are introduced and the chemical for- 
mula gets very long. The properties that correlate with 
e/a are some of the most difficult positive-temperature 
properties to predict, like magnetization, or free energy 
difference between two phases. As a modern exam- 
ple, the Heusler family of alloys is currently perhaps 
the most fertile area in materials science for discovery 
of new alloys with applications in diverse areas, such 
as microelectronics (especially “spintronics”), infor- 
mation storage, biomedicine, actuation, refrigeration, 
energy conversion, and energy storage. Most of the dis- 
covery of new alloys for these applications is guided 
simply by e/a; it is the main theoretical tool. Why is 
e/a so important? What is the e/a dependence of a 
constitutive equation of continuum mechanics? 

A fundamental conceptual problem for multiscale 
methods concerns time dependence. A standard ap- 
proach at atomic level is to use the equations of molec- 
ular dynamics, based on Newton’s laws of motion for 
the nuclei, at positions yi(t),...,y n (t): that is, 


= fi(yi,...,y n ), 
7t( 0 ) = y°. 


^(0) = v?, i = 


( 20 ) 


The constants mi,m 2 ,...,m n are the corresponding 
masses. The force /, on nucleus i depends on the posi- 
tions of all the other nuclei, as reflected by the nota- 
tion. This force could be given by quantum mechanics 
for all the electrons as described above, parametrized 
using the instantaneous nuclear positions. This is again 
the Born-Oppenheimer approximation. Of course, we 
would then have to do the very difficult quantum 
mechanical calculations described above at each time 
step, so in this case it would be essential to find sim- 
plifications. One such simplification would be to find 
accurate but simpler models of atomic forces. In any 
case, molecular dynamics is considered a rather gen- 
eral framework underlying continuum theories of many 
materials. 

A fundamental dilemma is that, regardless of the 
atomic forces, the equations of molecular dynamics are 
time reversible, while every accepted sufficiently gen- 
eral model in continuum mechanics is time irreversible. 
That is, if we define yt(t) = yi(-t) and change the sign 
of Vo, we see that yi(t),...,y n (t) solves (20). If we do 
the analogous change in, say, the Navier-Stokes equa- 
tions, i.e., if we begin with a solution v(y,t),p(y,t) 
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and define v(y,t) = -v(y,-t), p(y,t) = p(y,-t), 
we see that v(y,t),P(y,t) satisfies the Navier-Stokes 
equations if and only if the term multiplying the viscos- 
ity is zero. That is, the form of solution is unaffected by 
the viscosity of the material, a degenerate case indeed. 

There is no widely accepted solution to this dilemma, 
though many ideas have been suggested. From a math- 
ematical viewpoint, it is certainly difficult to imagine 
how any kind of rigorous averaging of the equations 
of molecular dynamics would somehow deliver a time- 
irreversible equation from a time-reversible one. And 
this is not just a quirk of molecular dynamics: every 
fundamental atomic-level equation of physics is dis- 
sipation free and time reversible. A suggestion made 
by Boltzmann that might in fact be consistent with a 
mathematical treatment has been elaborated in a recent 
article by Oliver Penrose. He says that flows seen in 
nature do not correspond to solutions of the general 
initial-value problem (20). Rather, any real flow corre- 
sponds to special “prepared” initial data of the equa- 
tions of molecular dynamics. Who does this prepara- 
tion in nature? Penrose suggests that merely running a 
dynamical system for some time could do this kind of 
preparation. (Admittedly, time does not begin at f = 0 
in (20).) He discusses a method of weighted averag- 
ing, involving the initial conditions, that does intro- 
duce irreversibility. Penrose’s suggestion is appealing 
and will probably resonate with anyone who has done 
molecular dynamics simulations. With almost any way 
of choosing initial data short of running it through the 
dynamical system, solutions of the equations of molec- 
ular dynamics invariably begin with a transient that 
would be regarded as unphysical from a macroscopic 
point of view. 

Balancing these fundamental difficulties, there seems 
to be the possibility of tremendous simplification in 
some cases. Let us illustrate how easy it is to con- 
nect molecular dynamics with continuum theory by 
presenting a very simple way to average the equa- 
tions (20) of molecular dynamics. We use a method 
of R. J. Hardy, which has been recently analyzed (and 
extended), together with related approaches by Admal 
and Tadmor. Let cp : R 3 — ■ M be a simple averaging func- 
tion; c p is smooth, nonnegative, has compact support 
containing the origin, and has total integral equal to 1. 
Consider a solution yi (t), . . . ,y n (t), t > 0, of the equa- 
tions of molecular dynamics (20). Recenter cp on each 
instantaneous atomic position, multiply the equations 


of molecular dynamics by cp, and sum over the atoms: 

n 

X myi(t)cp(y - yi(t)) 

i = 1 

n 

= X fi(yi(t),...,y n (t))qp(y - yi(t)). 

i=l 

With this kind of spatial averaging it is natural to define 
the density as p(y, t) = X mi(p(y - yi(t)) ^ 0. We can 
also define the linear momentum p(y , t) by averaging 
the linear momenta of the particles: 
n 

p(y, t) = X m i y i (t)cp(y - y t (t)). (21) 

i=i 

The Eulerian velocity is then defined wherever p > 0 by 
v(y, t) = p(y,t)/p(y,t). Notice that we have already 
avoided the tricky problem of averaging a product by 
simply defining it away. That is, for the purposes of 
averaging, the fundamental quantity is the momentum. 
If, instead of the above, we had defined the density and 
velocity first, in the obvious ways, then we would have 
had the nasty problem of trying to express the aver- 
age of a product in terms of the product of averages in 
order to get the momentum. Based on this method of 
averaging, the velocity of continuum mechanics is not 
the average velocity of the particles (as is suggested 
above and in most continuum mechanics books) but 
rather it is the average momentum divided by the aver- 
age density, which can be something quite different! 
On the other hand, we do get a balance of mass exactly 
as in continuum mechanics for free because by these 
definitions 

dp d ^ 

- = -XwP(y-^u» 

i=l 

n 

= - X m.iVqp(y - yi(t)) ■ yi(t) 

i = 1 

= -divp = -div(pv). 

Continuing with our averaging, we bring the time 
derivative out of the right-hand side of (21) and intro- 
duce the definitions of velocity and density. We then 
get 

3 n 

— (pu) + diVy X m(yi(t) ® yi(t))qp(y - y { (t )) 

° t t=i 

n 

= X ftviy - ydt)). 

i=i 

We rewrite the second term by replacing y by y - v 
(which, incidentally, makes it insensitive to a Galilean 
transformation), and then we compensate for this 
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insertion. After a few elementary manipulations we 
obtain 

KH +vt, K 

n n 

= X fi<P ~ div X m i(yi - v ) ® (yi - V ^ ( P‘ (22) 

i = 1 i = 1 

This takes exactly the same form as the balance of 
linear momentum (10) given above. 

It is of course compelling to define div T to be equal 
to the right-hand side of (22) and to solve for T to get an 
atomistic definition of stress. But specification of the 
divergence of a tensor is a rather weak restriction on 
the tensor; in RCC, we can add the curl of a vector field 
to each row of T, and the only restriction on these three 
vector fields comes from the symmetry of the stress. 
Various authors have implicitly made different choices 
of these curls, but there is no general agreement on 
which, if any, of the corresponding stresses ought to be 
the stress of continuum mechanics. And, of course, this 
may not be the best way to average. Averaging molecu- 
lar dynamics to get anything that vaguely resembles a 
constitutive equation is much less clear. 

There are additional hopeful directions of research. 
First, there is general mathematical experience with 
asymptotics. Many examples show that when doing 
asymptotics it is only certain quantities in the under- 
lying theory that actually affect the asymptotic result. 
Identification of these quantities can be the beginning 
of a solution of the multiscale problem or the start 
of a new branch of continuum mechanics. The circle 
of ideas surrounding the Cauchy-Born rule, the quasi- 
continuum method of Tadmor, Ortiz, and Phillips, and 
the asymptotic methods of Blanc, Cances, Le Bris, and 
Lions are successes in this direction but mainly so 
far only in the static case. In light of these develop- 
ments, one can almost imagine a finite-element method 
in which the subroutine that appeals to the constitutive 
relation is replaced by an efficient atomistic calculation. 

Probabilistic approaches are also promising. These 
recognize that the equations of molecular dynamics 
are sufficiently irregular that they might be amenable 
to a probabilistic treatment, as successfully under- 
taken by statistical mechanics in the case of macro- 
scopic equilibrium. Probability theory consists of an 
arsenal of highly developed techniques once a proba- 
bility measure is found, but it does not say much about 
where to get the probability measure in the first place. 
As a starting point, perhaps it is time to revisit the 


kinetic theory of gases, the only truly nonequilibrium 

statistical mechanics we have. 
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IV. 2 7 Pattern F ormation 

Arnd Scheel 


1 Introduction 

Patterns in nature fascinate observers and challenge 
scientists. We are particularly intrigued when simple 
systems generate complex patterns or when simple, 
highly organized patterns emerge in complex systems. 
Similarities in patterns across many fields suggest 
underlying mechanisms that dictate universal rules for 
pattern formation. We notice stripe and spot patterns 
on animal coats, but also in convection patterns. Rotat- 
ing spiral waves organize collective behavior in bacte- 
rial motion, in chemical reaction, and on heart muscle 
tissue. Beyond simple observation, regular patterns are 
created in experiments, and a tremendous amount of 
theory has helped to predict phenomena. In this article 
we will discuss some of those phenomena and stress 
universality across the sciences. We focus on dissipa- 
tive, or damped driven, systems, where a free energy 
is dissipated yet complex spatiotemporal behavior is 
sustained far from thermodynamic equilibrium. Spe- 
cific applications arise in, but are not limited to, biol- 
ogy, chemistry, the social sciences, fluids, optics, and 
material science. 

Historically, much of the research on pattern for- 
mation was motivated by fluid experiments, such as 
those on Rayleigh-Benard convection. When a station- 
ary fluid is heated from below, heat conduction is 
replaced by convective heat transport above a certain 
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critical temperature gradient. Convective transport can 
occur through convection cells, often arranged in typ- 
ical hexagonal arrays, or through convection rolls that 
form stripe patterns. Similar dichotomies were known 
from animal coats, between zebras and leopards, or, 
quite strikingly, in mutations of zebrafish (see figure 1). 

Turing noticed in 1954 that the interplay of very sim- 
ple mechanisms — diffusion and chemical reaction — can 
be responsible for the emergence of patterns. His pre- 
dictions were experimentally realized only in 1990 by 
de Kepper and collaborators, who produced spot- and 
stripe-like patterns similar to figure 1 in an open-flow 
chemical reactor. Time-periodic rather than such spa- 
tially periodic behavior was observed much earlier, by 
Belousov, in the 1950s, and then rediscovered and pop- 
ularized later by Zhabotinsky. They observed sustained 
temporal oscillations in a chemical reaction. In addi- 
tion, these temporal oscillations in the reaction could 
cause complex spatial patterns. Yet much earlier than 
this, in 1896, Liesegang observed regular ring patterns 
when studying how electrolytes diffuse and precipitate 
in a gel. It is worth mentioning that Liesegang’s pat- 
terns and their characteristic scaling laws are poorly 
understood even today, particularly when compared 
with Belousov’s reaction or Turing-pattern formation. 

In this discussion we have avoided the most basic 
question: what is a pattern ? In a system that is invari- 
ant under translations in time and space, we naturally 
expect unpatterned solutions, that is, solutions that 
do not depend on time or space. We refer to such 
solutions as spatiotemporally uniform states. In fact, 
such uniform states “should be” the thermodynamic 
equilibrium and hence should be observed in exper- 
iments after initial temporal transients. A pattern is 
the opposite: a solution that is not spatiotemporally 
uniform. Narrowing this definition of a pattern fur- 
ther, one might also want to rule out solutions that 
are nonconstant in space and time only for an initial 
temporal transient, or near boundaries of the spatial 
domain. Note that in this characterization we started 
with a system that does not depend explicitly on time or 
space. This excludes patterns forced by external influ- 
ences, such as masking or printing textures into sub- 
strates, and narrows our view to what one might refer 
to as self-organized patterns. Basic examples of spa- 
tial, temporal, and spatiotemporal states are shown in 
figure 3. 

Explanations of patterns therefore often start with a 
partial differential equation 

d t u = F(d™u , . . . , 3 x u,u), 



Figure 1 Four different pigment patterns for the homo- 
zygous zebrafish corresponding to different alleles of the 
leopard gene. 


for the dependent variables u e B. ,v , on an idealized 
unbounded, translation-invariant, spatial domain x e 
M”, n = 1,2, 3. Since F does not depend on time t or 
space x explicitly, such systems typically support spa- 
tiotemporally uniform states u(t,x) = u e W N for all 
times t£l and in the entire spatial domain x e R n , 

satisfying F(0 0 ,u) = 0. They may, however, also 

accommodate solutions that depend on x and t, even 
in the limit as t — oo . 

Prototypical examples are reaction-diffusion sys- 
tems, 

3 t u = DAu + /(u), 

where A is the Laplacian, D a positive diffusion matrix 
D+D t > 0, and f(u) denote the reaction kinetics. They 
have been extensively studied as a prototype of pattern 
formation, motivated largely by Turing’s observation 
that simple reaction and diffusion may explain many of 
the complex chemical and biological patterns that we 
see, possibly even patterns such as those in figure 1. 
Somewhat simpler (because it is scalar) is the Swift- 
Hohenberg equation, 

3tu = - (A + l) 2 w + pu - u 3 , 

mimicking instabilities in Rayleigh-Benard convection 
or Turing-pattern formation. When considered on x 6 
M”, there always exists a trivial translation-invariant 
solution u(t, x ) = 0. For p < 0, most initial conditions 
u( 0, x) converge to 0, while for p > 0 there are station- 
ary, spatially periodic solutions u(t,x) ~ ypcos(kxi) 
with wave number k ~ 1. 

A major theme of research in pattern formation is 
to describe, in this example and more generally, the 
longtime behavior of solutions based on simple coher- 
ent building blocks, such as spatially periodic patterns, 
defects, or fronts. 

2 Linear Predictions 

A much-studied scenario for pattern formation lets the 
spatiotemporally uniform state u destabilize while a 
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Figure 2 The dispersion relation for 
Turing instabilities (Swift-Hohenberg). 


parameter p is increased. We illustrate this scenario in 
the Swift-Hohenberg equation, where we can analyze 
the linearization at u = 0, d t u = Lu = [-(A+l) 2 +p]u, 
using the Laplace-Fourier transform: solutions of the 
form e At ~ik-x exist when 

d(\,ik) = -(1 - \k\ 2 ) 2 + p- A = 0, 

an equation usually referred to as the dispersion rela- 
tion. 

Solving for A = A (k), one finds A e R and the 
following regimes (see also figure 2): 

p < 0, A(k) < 0 for all k (stable), 

p = 0, A(k) = 0 for |fc| =1 (critical), 

p > 0, A(k) > 0 for |fcj ~ 1 (unstable). 

Now consider evolving an initial condition consisting of 
a superposition of exponentials J r2 u(k)e lk x d k under 
the linear equation d t u = Lu for p = 0. Given the expo- 
nential decay e At , A < 0, for wave numbers |k| =£ 1, 
we expect that the solution is dominated by wave num- 
bers |k| = 1 for large t, u(t,x) ~ J| fc | =1 u(k)e lkx dk.For 
p > 0 one still expects such wave numbers to dominate 
the solution given the relative faster growth. In fact, one 
often postulates that the linearly fastest-growing modes 
(|k| = 1 for Swift-Hohenberg) will also be dominantly 
observed in nonlinear systems. 

Beyond the Swift-Hohenberg equation we are inter- 
ested in dissipative systems, where most modes decay, 
ReA(k) < 0 for \k\ large. We also focus on isotropic 
systems, where u(t,x) is a solution precisely when 
u(t, g ■ x) is a solution for any g e E(ri), the Euclidean 
group of translations, rotations, and reflections in R”. 
At criticality, we typically expect Re A(k) = 0 at | fe | = k* 
for some unique k* and A(k) = ito*. One can classify 
such instabilities by focusing on a simple Fourier mode, 
so that at criticality, u(t,x ) ~ e 1<a, * t ~ k * Xl) (see figure 3 
for the resulting basic patterns). 



Figure 3 Linear patterns with (a) to* = 0 , fc* =*= 0 (spa- 
tial pattern), (b)co* =*= 0, fc* = 0 (temporal pattern), and 
(c) at*, k* =f= 0 (spatiotemporally periodic). 


Linear predictions are, however, notoriously ambigu- 
ous. At p = 0 there are a plethora of bounded solutions 
formed by arbitrary superposition of critical modes. 
For instance, when to *, k* * 0, we find traveling waves 
e i (wt-k-x) an( j standing waves e 1<cot-fc ' x) + e 1 (cot+fc-x)_ jjj 
the two-dimensional Turing case, summing modes kj 
on an equilateral triangle gives hexagonal patterns, and 
choosing four kjS on a square gives squares. Averag- 
ing uniformly over all critical modes |fc] = k* gives 
Bessel functions, reminiscent of target patterns with 
maxima on concentric circles with radii rj ~ jk * (see 
figure 5(c)). For p > 0 there is also ambiguity in the 
selected wave number. For instance, in the case k* = 0, 
to* * 0, we may find wave trains and standing waves 
of long wavelength | fc| =t= 0 in the linear prediction. 
When we try to determine which patterns and wave 
numbers will actually be observed for most initial con- 
ditions, we therefore need to take nonlinearities into 
consideration. 


3 Symmetry 

Given the ambiguity in the linear predictions, it is quite 
surprising to find how many systems settle into simple 
spatiotemporally periodic states, involving but a few 
of the critical linearly unstable wave vectors k. Without 
explaining why spatial periodicity is favored, one can 
analyze systems that are invariant under the Euclid- 
ean group by a priori restricting to spatially periodic 
functions. In other words, we may restrict to functions 
u(x ) = u(x + pj), where the vectors pj generate a 
lattice in R n . Going back to the striking dichotomy 
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between spots and stripes in figure 1 , consider the 
hexagonal lattice of width L > 0 generated by pi = 
1 (1, 0) T and p 2 = KV3/2, 1/2) T , which allows for both 
hexagons and stripes. The linear analysis of, say, the 
Swift-Hohenberg equation then needs to be restricted 
to wave vectors in the dual lattice, fc ■ pj = 2 tt. For 
fc* ~ 1, the choice L = 2tt//c* allows for six critical 
(Re A = 0) wave vectors. 

The dynamics of systems near equilibria with finitely 
many critical eigenvalues in the linearization can in fact 
be reduced to a set of ordinary differential equations 
(ODEs). Such a reduction is exact in the vicinity of the 
trivial equilibrium, where all solutions converge to a 
Unite-dimensional center manifold or escape the small 
neighborhood. Both the center manifold and the vec- 
tor field on the center manifold can be computed to 
any order in the amplitude of the solutions and the 
parameter p. Typical equations on the center manifold 
are 

A 1 = pA 1 +5A 2 A 3 -A 1 (\A 1 \ 2 + k(\A 2 \ 2 + \A 3 \ 2 )+ ■ ■ ■), 

A 2 = pA 2 + SA 3 A 1 - A 2 (\A 2 \ 2 + k(\A 3 \ 2 + \Ai\ 2 ) + ■ ■ ■ ), 

A 3 =pA 3 + SAiA 2 -A 3 (\A 3 \ 2 + k(\Ai\ 2 + \A 2 \ 2 )+ ■ ■ ■ ) 

with real parameters k, 8. The invariance under com- 
plex rotation A\ •- e 1 T Ai, reflections A\ <- Ai, A 2 ■— 
-A 3 , A 3 •- -A 2 , and cyclic permutation Ay ■— A J+ i 
is enforced by the Euclidean symmetry of the full sys- 
tem. More precisely, translations and reflection in X\ 
as well as rotations by 2 tt /6 leave the lattice invari- 
ant; they generate the isotropy group of the lattice 
T xi Dg . The action of this group on the critical wave vec- 
tors gives precisely the invariances of complex rotation, 
reflection, and permutation. 

We always find nontrivial equilibria with maximal 
isotropy, that is, roughly speaking, equilibria that are 
invariant under a symmetry group that is a maximal 
subgroup of T x Dq. In this case, these are stripes, 
A 2 = A 3 = 0, Ai G R, or hexagons, Ay = Aj e R. Within 
the reduced differential equation one can study the sta- 
bility of these equilibria and predict whether stripes 
or hexagons should be observed for parameter values 
near an instability. Depending on the parameters k, 8, 
one calculates equilibria and determines their stabil- 
ity within this reduced ODE. The fact that, typically, 
either hexagon equilibria or stripe equilibria will be sta- 
ble reflects the universal ubiquity of stripes and spot 
patterns, as shown in figure 1 . 

As a second example, consider spatiotemporal insta- 
bilities, * 0, x G R. Restriction to periodic 


functions yields reduced coupled-amplitude equations 
for left- and right-traveling waves, 

A+ — Ip + iw*)A + + A+ ( | A+ 1 2 + k\A- 1 2 ) + ■ ■ ■ , 

A- = (p + iro* )A_ + A_ ( [ A_ | 2 + k|A + | 2 ) + ■ ■ ■ , 

with complex parameter k. Spatial translations act as 
complex rotations A± •- e ± 1 T A ± , reflections act as 
A+ -> A+. Maximal isotropy roughly corresponds to 
traveling waves A_ = 0, A+ ~ e KOf or standing waves 
A+ = A_. Again, reduction combined with a bifurcation 
and symmetry analysis gives universal predictions for 
the competition between standing and traveling waves. 

The reduced equations on the center manifold are 
often referred to as Landau equations, which describe 
dynamics of dominant modes. Fourier modes that are 
compatible with the lattice but decay exponentially for 
the linearized problem can be shown to follow the tem- 
poral evolution of neutral modes precisely. More pre- 
cisely, all small solutions shadow solutions on the cen- 
ter manifold with exponentially decaying error. In geo- 
metric terms, this follows from the fact that the phase 
space is foliated by a strong stable fibration of the 
center manifold. 

While the approach outlined here can discriminate 
between systems that favor hexagons over stripes, or 
traveling waves over standing ones, it cannot predict 
wave numbers since the analysis is a priori restricted to 
a set of functions with prescribed period. On the other 
hand, a center manifold analysis cannot be performed 
directly for the system posed on an unbounded (or very 
large) domain since neutral linear Fourier modes are 
not (well) separated from decaying modes. 

We will look at three pattern-selection mechanisms 
below. First, periodic patterns such as hexagons may 
be unstable with respect to perturbations of the ini- 
tial conditions. Our analysis in this section considered 
perturbations with the same period. However, a pat- 
tern might well be stable with respect to such coperi- 
odic perturbations but unstable with respect to other 
ones, e.g., localized perturbations, a phenomenon usu- 
ally referred to as a sideband instability. Such sideband- 
unstable patterns will typically not be observed in large 
systems, thereby restricting the set of wave numbers 
present in large systems. Second, initial conditions may 
evolve into periodic patterns outside of small areas in 
the domain where defects form. Such defects can have 
a significant influence on the wave numbers observed 
in the system. And finally, patterns are often cre- 
ated through spatial growth processes. In mathematical 
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terms, patterns in the wake of invasion fronts often 
show distinguished, selected wavelengths. 

Before we address these different mechanisms, we 
will briefly review amplitude equations, which, beyond 
the Landau equations, approximately describe, in a sim- 
plified and universal fashion, the evolution of systems 
near the onset of instability. 

4 Modulation Equations 

Center-manifold equations describe the long-term be- 
havior of small-amplitude solutions exactly, for spa- 
tially periodic patterns. In most cases, leading-order 
approximations can be derived using scalings of ampli- 
tude and time. In the one-dimensional Swift-Hohenberg 
equation, the complex amplitude a(t) of the Fourier 
mode e 1 * solves an equation of the form d t a = pa - 
3a|a| 2 + ■ ■ ■ . Scaling a = p 1,2 A, T = pt for p > 0, 
we find djA = A - 3A|A| 2 + O(p). Equivalently, we 
could substitute an ansatz p 1/2 A(pt)e lx together with 
the complex conjugate (c.c.) into the Swift-Hohenberg 
equation and find djA = A - 3A|A| 2 as a compatibil- 
ity condition at order p 2/2 in the expansion. This latter 
method can be generalized, allowing slow spatial varia- 
tions of the amplitude, u(t,x) = p 1/2 A(pt, p 1/2 x)e lx + 
c.c. The scaling in x is induced by the quadratic tan- 
gency in the linear relation, where A = -4 (k - l) 2 + 
0((k- 1 ) 3 ) . An expansion now gives the compatibility 
condition 

d T A = 4dxxA + A- 3A|A| 2 . 

This equation is known as the ginzburg-landau 
equation [III. 14]. It is a modulation equation, as it 
describes spatial modulations of the amplitude of crit- 
ical modes. While in the case of periodic boundary 
conditions amplitude equations are the leading-order 
approximation to an ODE that describes the long-term 
dynamics of small solutions exactly, modulation equa- 
tions give approximations to long-term dynamics, at 
best. Such approximation properties typically rely on 
some type of stability in the problem, following the 
mantra of “consistency + stability => convergence.” It 
is not known if there exists an exact reduced descrip- 
tion of the long-term dynamics in terms of a single par- 
tial differential equation, which would coincide with the 
modulation equation at leading order. 

For Hopf bifurcations, to* * 0, k* = 0, one substi- 
tutes u(t,x) = p 1/2 A(pt, p 1/2 x)e lu ’* t + c.c. and finds 
the complex Ginzburg-Landau equation 

d T A = (1 + ia)AxA + A - (1 + i£)A|A| 2 , 


where the coefficient /I is responsible for frequency 
detuning of oscillations depending on the amplitude 
(nonlinear dispersion), and the coefficient « mea- 
sures frequency dependence on wave number (linear 
dispersion). 

When k* =#= 0, in space dimension n > 1, this 
approach is limited by the fact that there is a contin- 
uum of critical modes \k\ = k *, while amplitudes Aj 
can capture only bands near distinct wave numbers 
kj. One can include modes with neighboring orienta- 
tions, u(t,x) = A(pt, p 1/2 x, p i,4 y)e lk * x + c.c., but the 
resulting Newell-Whitehead-Segel equation 

d T A = -Ox - i3 y 2 ) 2 A + A- A|A| 2 

poses several analytical challenges. 

Both Landau equations on the center manifold and 
modulation equations can also be interpreted as uni- 
versal normal forms near the onset of instability, thus 
explaining universality of patterns across the sciences 
to some extent. In both cases one eliminates fast spa- 
tiotemporal dependence via some effective averaging 
procedure. For the Landau equations, averaging can be 
more systematically understood in terms of normal- 
form transformations that simplify the equations. In 
particular, temporal oscillations can be exploited to 
eliminate coefficients in the Taylor jet of the vector 
field through polynomial coordinate changes, effec- 
tively averaging the vector field over the fast oscilla- 
tions. For modulation equations, this procedure is less 
systematic, reminiscent of homogenization and effec- 
tive medium theories. In that regard, modulation equa- 
tions not only simplify the analysis but also provide 
approximations that allow for effective simulations. 

5 Stability 

Most of our discussion so far has been motivated by the 
presence of a trivial, spatiotemporally uniform state 
that loses stability as a parameter p is increased. While 
this state still exists for p > 0, it would not be observed 
experimentally since small perturbations would grow 
exponentially and drive the system toward a different 
state. Restricting ourselves to periodic functions, we 
found ODEs that show that small periodic perturba- 
tions will result in spatially periodic, stable patterns. 
In this analysis, however, stability is understood only 
with respect to spatially coperiodic perturbations. More 
realistically, we should ask for stability against spatially 
random or, at least, spatially localized perturbations. In 
the Swift-Hohenberg equation we would study initial 
conditions uq(x) = u per (x ) + v(x), with v(x) small, 
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Figure 4 (a) The Bloch dispersion relation before and after 
sideband instability, (b) The space-time diagram of Eckhaus 
coarsening. Here and throughout, time is plotted vertically. 

localized, and u per (x) a stationary, spatially periodic 
pattern. To leading order, v(x ) satisfies the linearized 
equation at u per : 

d t v = —id# + l) 2 v + fjv - 3u 2 er (x)v. 

The operator on the right-hand side possesses spa- 
tially periodic coefficients, and its properties can be 
expressed in terms of Fourier-Bloch eigenfunctions, 
e 1>x u P er(x), replacing the Fourier analysis near the 
spatially homogeneous steady state. One finds eigen- 
values Aj(y) that measure the temporal evolution 
of quasiperiodic perturbations to the periodic state. 
Translation of the pattern w per induces a neutral 
response; technically, v pe r(x) = d x u per (x) and y = 0 
correspond to a zero eigenvalue, sometimes referred 
to as a Goldstone mode. Varying y, we can study 
modulations of this translational mode and possible 
instabilities. 

The calculations are conceptually simpler in the 
modulation equation, where periodic patterns can be 
reduced to spatially constant states, such as A(X) = 1 
in the Ginzburg-Landau equation, so that the Bloch 
wave analysis reduces to Fourier analysis. One finds, 
for fixed p, a band of wave numbers k where periodic 
patterns are stable. Patterns outside of this band are 
sideband unstable and typically cannot be observed in 
large domains. The simplest type of instability is the 
one-dimensional Eckhaus instability of stationary pat- 
terns that arise in Turing instabilities k* * 0, to* = 0 
(see figure 4). 

Phenomenologically, perturbations of unstable pat- 
terns grow and change the wave number by temporar- 
ily introducing defects into the pattern. In two space 
dimensions, rotational modes induce zigzag and skew- 
varicose instabilities. Generally, in the parameter plane 
spanned by wave number k and system parameter p, 


stable patterns often occupy a bounded region com- 
monly referred to as the Busse balloon. More dramatic 
instabilities occur in Hopf bifurcations, when a/3 < 
-1: the band of wave numbers corresponding to sta- 
ble spatiotemporal patterns vanishes, and dynamics in 
extended systems appear to sustain complex dynamics. 

6 Defects 

We care about imperfections or defects in periodic pat- 
terns not only because they naturally arise in experi- 
ments and simulations but also because they can play 
a crucial role in selecting wave numbers and wave vec- 
tors. Prominent examples are spiral waves in oscilla- 
tory media, which act as effective wave sources and 
select wave numbers and wave vectors in large parts of 
the domain. Also, interfaces between patches of stripes 
with different orientations tend to select wave num- 
bers. In a similar vein, boundary conditions can select 
the orientation of convection rolls in Rayleigh-Benard 
convection: typically, rolls orient themselves perpen- 
dicularly to the boundary, but heated boundaries allow 
for a parallel alignment. 

When referring to defects we usually imply that the 
deviation from perfect spatiotemporally periodic struc- 
tures is in some sense localized, and the temporal 
behavior is coherent, e.g., periodic in an appropriate 
coordinate frame. Such solutions can sometimes be 
found explicitly in amplitude equations; the Nozaki- 
Bekki holes in the complex Ginzburg-Landau equations 
are a prominent example. On the other hand, one would 
like to approach existence, stability, or even interac- 
tion of defects in a mathematically rigorous yet system- 
atic fashion, similar to the treatment of spatially peri- 
odic patterns. We have already mentioned that center- 
manifold reductions are not available once we give up 
the restriction to spatial periodicity. We can, however, 
restrict to temporal periodicity, or even stationary solu- 
tions, possibly propagating or rotating with a fixed 
speed. Such solutions are amenable to an approach as 
systematic and rigorous as for spatially periodic struc- 
tures: one interchanges the roles of space and time 
and studies spatial dynamics. We can, for instance, 
find small stationary solutions to a reaction-diffusion 
system 

d t u = Dd xx u + f (u\p), uel*, 
by looking for small bounded solutions to the ODE 

u x = v, 

v x = -D~ l f(u\p). 
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Figure 5 (a) Spirals, (b) grain boundaries, and (c) a target 
pattern in the Swift-Fiohenberg equation. 



Figure 6 Dislocations and concave and convex 
disclinations in the Swift-Fiohenberg equation. 


An x-independent equilibrium of the reaction-diffu- 
sion system corresponds to an equilibrium of this ODE; 
a spatially periodic pattern corresponds to a periodic 
orbit. Close to a Turing instability, center-manifold 
analysis and normal-form transformations reveal a 
universal description of stationary patterns: 

Ax ikA + B + ■ ■ ■ , 

B x = i kB - pA + A\A\ 2 + ■■■ , 

which at leading order recovers stationary solutions to 
the Ginzburg-Landau equation after passage to a coro- 
tating frame (A,B) •- e lkx (A,B) and scaling. Similar 
constructions can find traveling waves u(x - ct) and 
time-periodic traveling waves u(x - ct, cot), u(£, t) = 
u(5, t + 2tt). In the latter case, the equation 

UjC = v, 

= -D _1 (/(u;p) + cv - cod T u) 

is an ill-posed degenerate elliptic partial differential 
equation. Nevertheless, center-manifold reduction and 
normal-form techniques can be used here, too, to derive 
universal ODEs for the shape of coherent defects. Even 
more generally, we can use these methods to study 
coherent structures for x = (xi,...,x n ) e K", pro- 
vided that we impose periodic boundary conditions on 
X 2 , ... ,x„ while studying spatial dynamics in the x\- 
coordinate. In this category one finds grain boundaries 
and certain types of dislocations. Somewhat more sub- 
tle reductions based on dynamics in the radial coor- 
dinate have been used to study point defects such as 
radially symmetric target patterns or spiral waves (see 
figure 5). 

While these methods have been very successful in 
establishing existence and studying linearized stabil- 
ity through the eigenvalue problem in many examples, 
they fail to give a complete description in higher space 
dimensions: properties of dislocations and disclina- 
tions are not understood in this detailed fashion (see 
figure 6). 

The key analogy in this spatial-dynamics setting re- 
lates defects and coherent structures to homoclinic and 


heteroclinic orbits, connecting equilibria or periodic 
orbits. Pushing this analogy further, we can reinterpret 
dynamical properties of heteroclinic orbits as proper- 
ties of defects. One could now systematically reinter- 
pret results in the literature and thereby effectively con- 
struct a dictionary that translates between dynamical- 
systems terminology in the spatial-dynamics descrip- 
tion and properties of defects. 

In oscillatory media (co =#=0), defects can be classified 
according to dimensions of unstable and stable man- 
ifolds that intersect along a heteroclinic orbit in the 
ill-posed spatial-dynamics description. The codimen- 
sion of the intersection can be translated into group 
velocities in the far field. Defects where group veloci- 
ties point away from the defect (sources) correspond 
to codimension-two heteroclinic orbits. Sinks, where 
group velocities point toward the defect, correspond 
to transverse intersections. 

An illustrative example is a small localized inhomo- 
geneity sg (x ) in a system that otherwise sustains a 
spatially homogeneous temporal oscillation. One finds 
that for e > 0, say, the inhomogeneity typically acts as 
a wave source from which phase waves propagate into 
the medium, while for e < 0 waves travel toward the 
inhomogeneity with speed l/|x| for large distances x 
from the inhomogeneity (see figure 7). 

7 Fronts 

The ability to describe in detail the fate of small random 
perturbations of an unstable homogeneous equilibrium 
in general pattern-forming systems such as the Swift- 
Hohenberg equation seems elusive. Wave numbers can 
vary continuously across the physical domain, while 
embedded defects move and undergo slow coarsening 
dynamics. 

A more tractable situation arises when initial pertur- 
bations are spatially localized and patterns emerge in 
the wake of fronts as the instability spreads spatially. 
Similar in spirit to the fastest-growing-mode analysis 
that we described above, but quite different in the 
details, one tries to predict patterns that arise in the 


IV.27. Pattern Formation 


465 


wake of an invasion front using the linearization first. 
For a linear system, the location of the leading edge 
of an invasion front can be determined by testing for 
pointwise stability in comoving frames: in a steady coor- 
dinate frame, one typically sees exponential growth in a 
finite window of observation, whereas in a frame mov- 
ing with large speed, perturbations decay within such a 
finite window. The smallest of all speeds c that outrun 
perturbations in this sense gives us the linear spread- 
ing speed. An observer traveling with the linear spread- 
ing speed will see a marginally stable state in a finite 
window of observation. 

We can determine pointwise stability, that is, decay in 
a finite window of observation, using the Laplace-Fou- 
rier transform in a refined way as follows. The Laplace 
transform reduces the evolution problem dtu = Lu 
into a study of the resolvent (A - X) -1 , which in turn 
can be analyzed using the Fourier transform. The key 
difference between pointwise and overall (say, L 2 ) sta- 
bility analysis is that the resolvent is applied to spa- 
tially localized initial conditions. In fact, the resolvent 
can be represented as a convolution with a Green func- 
tion G\(x) that is readily computable via the Fourier 
transform. Boundedness of the resolvent as an oper- 
ator typically requires integrability of the convolution 
kernel. Convolving with localized initial conditions, we 
can relax this condition, however, and merely require 
pointwise (fixed-x) analyticity of G\(x) in A. Equiva- 
lently, we are allowed to shift the Fourier contour that 
is used to calculate the convolution kernel off the real 
fc-axis into the complex plane. Obstructions to shifting 
this contour — or, equivalently, pointwise singularities 
of the convolution kernel — typically occur at branch 
poles. These can be found by analyzing the complex 
dispersion relation d( A, v) = 0, which is obtained from 
the ansatz u ~ e At+vx . Rather than finding simple roots 
for given Fourier mode v = ik, one allows v to be 
complex and looks for double roots: 

d( A,v) = 0, 3 v d(A,v) = 0. 

Such double roots (with an additional pinching condi- 
tion) typically give singularities of the pointwise Green 
function and determine pointwise stability; we have 
pointwise instability if and only if such a double root 
is located in Re A > 0. 

In summary, pointwise stability is determined by 
pinched double roots of the dispersion relation. Mar- 
ginal stability in a frame moving with the linear spread- 
ing speed c implies that such a double root is located 
on the imaginary axis, A = ito. This linear oscillation 



Figure 7 (a) Two sources and a sink and (b) an invasion 
front creating an unstable pattern, followed by a turbulent 
state, both in an oscillatory system u>* * 0, k* = 0. 


with frequency to in a frame moving with speed c typi- 
cally selects a resonant nonlinear pattern u(coot - kx) 
such that to = too - kc. 

For nonlinear systems there is ample experimental 
and numerical evidence that the linear predictions are 
often accurate, although convergence is usually slow, 
0(t _1 ). Such convergence results have been estab- 
lished mathematically only for order-preserving sys- 
tems and in particular for scalar reaction-diffusion sys- 
tems. For pattern-forming systems, which intrinsically 
violate order preservation, existence and stability of 
invasion fronts have been established near Turing-like 
small-amplitude instabilities, in particular in the Swift- 
Hohenberg equation and the Couette-Taylor problem. 
In the complex Ginzburg-Landau equation, pattern- 
forming fronts can be found as traveling waves, finding 
connecting orbits in a three-dimensional ODE. Interest- 
ingly, the linear wave number predicted by the double- 
root criterion is typically nonzero, so invasion fronts 
create traveling waves in their wake (see figure 7). It is 
worth contrasting this with the linear fastest-growing- 
mode analysis based on the Fourier transform, which 
predicts spatially homogeneous oscillations. Since such 
wave trains can be sideband unstable, the state in the 
wake of the primary invasion front is subject to a sec- 
ondary invasion, where a turbulent state typically takes 
over. 

Fronts also play an important role in biological 
growth and crystal growth, as well as in phase-sepa- 
ration processes. Model problems include the cahn- 
hilliard equation [III.5], phase-field systems, and the 
Keller-Segel model. Fronts are known to exist in only a 
few cases. As for the situation in oscillatory media (fig- 
ure 7), primary invasion fronts create unstable patterns 
that are invaded by secondary fronts (see figure 8). 

Invasion fronts also provide key building blocks 
in spatial-growth or deposition processes. We can 
often model pattern growth via systems in which an 
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Figure 8 Space-time plots of two-stage front invasion in 
the Keller-Segel model, with merging versus ripening in the 
secondary front. 


instability parameter p is increased spatiotemporally 
in the form p ~ - tanh(x - st), so that instability is 
triggered externally in the region x < st. Liesegang’s 
recurrent precipitation experiment falls into this cat- 
egory, as we will now explain. An outer electrolyte A 
diffuses into a gel that is saturated with an inner elec- 
trolyte B (see figure 9). Both react, and the reaction 
product accumulates until a supersaturation threshold 
is reached and precipitation is initiated. The concentra- 
tion of AB acts as an effective parameter that increases 
in the wake of the reaction-diffusion front. The main 
pattern-forming mechanism is the precipitation in the 
wake of the front. Intuitively speaking, precipitation is 
initiated only once the concentration locally exceeds a 
supersaturation threshold. Since the solute, providing 
the input to the precipitation process, can diffuse, it 
will be depleted in a neighborhood of the region where 
the process has been initiated, so that the supersatura- 
tion threshold will next be reached at a finite distance 
from where the first precipitate nucleated. More math- 
ematically, precipitation can be understood as a simple 
conversion of solute s into precipitate p, 

d t s = As - f(s,p), 

d t p = i<Ap + f(s,p), 

with conversion rate / and a small diffusion rate k <sc 1 
of the precipitate. The product AB feeds as a source 
term into the equation for 5 , effectively increasing the 
value of 5 until instability is triggered. Such instabili- 
ties turn out to be pattern-forming sideband instabil- 
ities, which in turn generate rhythmic oscillations in 
the wake of the A-B reaction front. Wave numbers in 
the wake of such triggered fronts are therefore a cen- 
tral ingredient in the prediction of wave numbers in 
Liesegang patterns. 

To summarize, our excursion into pattern forma- 
tion in the wake of fronts points toward a promis- 
ing route to a more systematic understanding of wave 



Figure 9 Space-time plots in numerical simulations 
and experimental Liesegang patterns. 


number and pattern selection, but many mathematical 
challenges lie ahead of us. 

8 More Patterns 

The point of view taken here emphasizes universality, 
mostly from the point of view of local bifurcations. 
Many exciting phenomena can be observed in reaction- 
diffusion systems in somewhat opposite parameter 
regimes when chemical species react and diffuse on dis- 
parate scales. Prototypical examples are the Gray-Scott 
and Gierer-Meinhardt equations, but one could also 
consider the FitzHugh-Nagumo, Hodgkin-Huxley, or 
Field-Noyes models for excitable and oscillatory media. 
Singular perturbation methods— both geometric meth- 
ods inspired by dynamical systems and matched as- 
ymptotic methods— have helped us to gain tremen- 
dous insight into the complexity of spike and front 
dynamics in such systems. The basic building blocks 
here are scalar reaction-diffusion equations, where a 
variety of tools allows for quite explicit characteriza- 
tion of solutions and eigenvalue problems. The cou- 
pling between fast and slow components changes the 
interaction between spikes and fronts so that com- 
plex arrangements of fronts and spikes that would be 
unstable in scalar equations may form stable patterns 
here. 

Beyond the pattern-forming instabilities that we have 
discussed in this article, more complex phenomena 
arise when conserved quantities interact with pattern- 
forming mechanisms. Examples include closed reac- 
tion-diffusion systems, fluid instabilities with neu- 
tral mean flow modes, and phase separation prob- 
lems modeled by Cahn-Hilliard or phase-field equa- 
tions. Conserved quantities can also be generated by 
symmetry through Goldstone modes, so descriptions 
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of flame front instabilities or sideband instabilities 
usually involve coupling to conservation laws. 

In a different direction, even simple systems such 
as the Swift-Hohenberg equation or the complex Ginz- 
burg-Landau equation support patterns that escape 
our simplistic scheme based on periodic structures and 
embedded defects. Examples are quasicrystal patterns 
involving nonresonant spatial wave vectors and turbu- 
lent states, in which coherence in patterns is visible 
only after taking temporal averages. 

Last but not least, patterns and coherent structures 
arise in spatially extended Hamiltonian systems, such 
as water-wave problems or nonlinear Schrodinger equa- 
tions. Universal phenomena such as solitons, plane 
waves, and sideband instabilities abound, and partial 
descriptions of small-amplitude dynamics via universal 
modulation equations are possible. 
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IV.28 Fluid Dynamics 

II. K. Moffatt 


1 Introduction 

Fluid dynamics is a subject that engages the attention 
of mathematicians, physicists, engineers, meteorolo- 
gists, oceanographers, geophysicists, astrophysicists, 
and, increasingly, biologists in almost equal measure. 
For mathematicians, the subject is the source of a wide 
range of problems involving both linear and nonlinear 
partial differential equations (PDEs). These equations 
arise from real-world natural phenomena and therefore 
provide serious motivation for exploring questions of 
existence, uniqueness, and stability of solutions. The 


nonlinear equations frequently involve one or more 
small parameters, allowing the development of pertur- 
bation techniques in their solution; thus, for example, 
singular perturbation theory is a branch of mathemat- 
ics that finds its origin in the boundary-layer theory 
of fluid dynamics developed in the first half of the 
twentieth century, the modern study of chaos (since 
the 1960s) finds its origin in the (Lorenz) equations 
describing thermal convection in the atmosphere, and 
so on. 

The word fluid covers liquids (such as water, oil, 
honey, blood, and liquid metals, to name but a few), 
gases (such as air, hydrogen, helium, carbon diox- 
ide, and methane), plasma (i.e., fully ionized gas at 
extremely high temperatures), and exotic fluids such as 
liquid helium or Bose-Einstein condensates, which exist 
only at temperatures near absolute zero. Even conven- 
tional “solids” can behave like fluids if observed over a 
very long timescale; for example, the rock-like medium 
of the Earth’s mantle flows slowly on a timescale of 
millions of years, and it is this that is responsible for 
the movement of the tectonic plates that gives rise to 
volcanic activity, earthquakes, and continental drift. 

Fluid dynamics starts with the continuum approx- 
imation, whereby the fluid is regarded as a medium 
whose state can be expressed in terms of properties 
that are continuous functions of position x = ( x,y,z ) 
and time t. Chief among these properties are the den- 
sity field p (x, t) and the velocity field u(x,t ) within the 
fluid domain. Thus the motion of individual molecules 
is ignored, and only properties that are averaged over 
at least millions of molecules are considered. 

Fluid dynamics covers a vast range of phenomena on 
all length scales, from microns (~10~ 6 m) in biology 
and nanoscale fluid dynamics to kiloparsecs (~10 21 m) 
in the fluid dynamics of the interstellar medium. At all 
these scales, the motion of the fluid medium is gov- 
erned by the NAVIER-STOKES EQUATIONS [III.23], which 
are essentially derived from the principles of mass con- 
servation and (Newtonian) momentum balance. In gen- 
eral, these equations must be coupled with thermo- 
dynamic equations of state and, when plasmas are con- 
sidered, with Maxwell’s equations for the electromag- 
netic field. In this article we shall, for simplicity, focus 
on the idealization of a nonconducting, incompressible 
fluid of constant density p, for which the Navier-Stokes 
(NS) equations become 

p(^ + ( u ■ V)uJ = -Vp + pV 2 u + / 
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and 

V ■ u = 0, 

where p(x,t) is the pressure field, p is the viscosity 
of the fluid, and fix, t) is any external force per unit 
mass (e.g., gravity) that acts upon the fluid. 

The incompressibility condition calls for special com- 
ment. In adopting this idealization, sound waves are 
filtered from the more general equations of fluid 
dynamics that take density variations into account. 
The approximation is generally valid provided the fluid 
velocities considered are small compared with the 
speed of sound in the fluid (~340 ms _1 in air and 
~1480ms _1 in water, at normal temperatures and 
pressures). Obviously, therefore, the incompressibility 
condition cannot be adopted if problems of transonic 
or supersonic flight are considered. For such prob- 
lems, the coupling of fluid motion with thermodynamic 
effects cannot be ignored. 

In the rest of the article we first consider the kine- 
matics of flow governed by this incompressibility con- 
dition alone before turning to the dynamics of flow, 
with particular reference to the limits of large and small 
viscosity. 

2 Ki n ematics of Flow 

2.1 Streamlines and Particle Paths 

Consider first the case of two-dimensional flows for 
which u = (u(x,y,t),v(x,y,t),0). The incompress- 
ibility condition becomes du/dx + dv /dy = 0, and it 
follows that there exists a stream function ip(x,y,t) 
such that 

u = dip/dy, v = -dip /dx. 

Since u ■ V ip = 0, it follows that u is everywhere parallel 
to the curves i p = const., which are therefore appropri- 
ately described as the streamlines of the flow. Note that 
ip has physical dimension I 2 / T (length 2 /time). 

A steady flow is one for which the velocity field 
is independent of t, i.e., u = u(x) or, in the two- 
dimensional case, ip = ip(x,y). In a steady flow, fluid 
particles follow the streamlines, which do not change 
with time. In a two-dimensional steady flow, initially 
adjacent particles on neighboring streamlines generally 
separate linearly in time due to the velocity gradient 
normal to the streamlines. If we consider a small patch 
of dye carried with the flow, then every pair of parti- 
cles within the patch separates linearly in time, so the 
whole patch is similarly stretched by the flow. 


The situation is very different in an unsteady flow. 
Now ip = i p(x,y,t) with dip/dt f= 0, and the stream- 
line pattern changes with time. The path x(t) of a fluid 
particle released from a point xo at time t = 0 is now 
determined by the dynamical system 

dx _ dip Ay _ dip 

At dy ’ dt dx ' 

with initial condition x(0) = xo. This is a second- 
order Hamiltonian system, in which the Hamiltonian is 
just the stream function ip. Initially adjacent particles 
can now separate exponentially in time, a symptom of 
chaotic behavior. This obviously has important implica- 
tions for the rate of stirring of any dynamically passive 
scalar contaminant in the flow. 

In a fully three-dimensional flow, even in the steady 
case the streamlines (and therefore the particle paths) 
can diverge exponentially, a behavior conducive to the 
rapid dispersion of any passive contaminant. 

The flow w(x, t) (which is assumed to be smooth) de- 
termines a time-dependent mapping X — • x = x{X, t ) 
of the flow domain onto itself. Under the incompress- 
ibility condition, this mapping is volume-preserving. 
Since the volume element is given by d 3 x = |J| d 3 X, 
where J is the Jacobian, 

= 3(xi,x 2 ,x 3 ) 

J ~ d(x i,x 2 ,x 3 )' 

it follows that in this case \J\ = 1. At time t = 0, 
the mapping is the identity x = X, with J = 1, so by 
continuity, J = +1 for all t > 0. The mapping is a 
volume-preserving diffeomorphism for all finite t. 

2.2 Rate of Strain and Vorticity 

In the neighborhood of any point (which may be cho- 
sen to be the origin x = 0), the velocity field may be 
expanded, provided it is sufficiently smooth, as a Taylor 
series: 

Ui(x) = u 0 i + CijXj + O ( | x| 2 ) , 

where Cy = dui/dxj\ x =o- This tensor may be split 
into a sum of symmetric and antisymmetric parts. The 
symmetric part, 

1 / dui dui \ 

e,J 2 V dxj + dxj ) ’ 

is the rate-of-strain tensor. Note that eu = 0, by virtue 
of incompressibility. Referred to its principal axes, 
this symmetric tensor is diagonalized so that eyXy = 
(ax, py,yz), where (a,/?,y) (with a + fi + y = 0) are 
the principal rates of strain. 
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The antisymmetric part of Cy is 


( °i j ~ 2 


/ duj 3 u j \ 

V dxj dxi J 


" 2 ^ijktOk, 


where w = V x u and coyXj = 4 to x x, a “rigid-body” 
rotation with angular velocity jto. The pseudovector 
to = V x u, which can of course be defined at any 
x, is the vorticity of the flow and is of the greatest 
importance in the dynamical theory that follows. 

In two dimensions, to = (0,0, co) with to = -V 2 tp. 
In this case, the linear flow CyXy may be expressed in 
the form «(x, -y) + |to(-r,x), with corresponding 
stream function ip = cnxy - ^to(x 2 + y 2 ). The stream- 
lines ip = const, are elliptic or hyperbolic according to 
whether 1 2 «/ co | > lor|2cx/co| < 1. In the special cases 
2cx = ± co, the flow is a pure shear flow, with rectilinear 
streamlines parallel to x = ±y, respectively. 


3 The Navier-Stokes Equations 

3.1 Stress Tensor and Pressure 


The momentum balance equation may be written in a 
very fundamental form due to Cauchy: 


P 


Dui 

Dt 


3 

dXi aij ' 


where cry is the stress tensor within the fluid, which 
may include the stress associated with any conservative 
force field of the form f = - VV that acts on the fluid 
(e.g., gravity with V = -pg ■ x). 

A Newtonian fluid, which is here assumed to be 

incompressible, is one in which cry is related to the 

rate-of-strain tensor ey through the linear isotropic 

equation cry = -p5y + 2 juey, where p = -^Cu, 

the pressure in the fluid. Substitution into the Cauchy 

equation above leads immediately to the Navier-Stokes 

equation, now in the form 

Dm 3m „ 1 „ n 

— a — + m ■ Vm = — Vp + vV‘m, 

D( 31 p f 

where v = p/p is the kinematic viscosity of the fluid. 

Note that v has physical dimension I 2 /T (like < p in 

section 2.1). 


3.2 The Reynolds Number 

Suppose that a flow is characterized by a length scale 
L (usually associated with the boundary geometry) and 
a velocity scale U (e.g., the maximum velocity in the 
fluid or on its boundary). Then, in order of magnitude, 
|Du/Df| ~ U 2 IL and |V 2 uj ~ U/L 2 . Hence, 

1DM/Dt| UL 
v|V 2 m| v 


The dimensionless number Re = UL/v is the Reynolds 
number of the flow. If Re <K 1, then viscous forces 
dominate over inertia forces, which are negligible in a 
first approximation. If Re » 1, then inertia forces are 
dominant; however, as we shall see later, viscous effects 
always remain important near fluid boundaries, no mat- 
ter how large Re may be; this is where boundary-layer 
theory must be invoked. 


3.3 The Vorticity Equation 


It is obviously possible to eliminate the pressure field 
by taking the curl of the NS equation. Using the vec- 
tor identity u ■ Vm = V(m 2 /2) - mxw, this yields the 
vorticity equation 

div _ , , _ 2 

-r— = V x (m x co) + vV co. 


The first term on the right-hand side describes trans- 
port of the vorticity field by the velocity, while the sec- 
ond describes diffusion of vorticity relative to the fluid. 
This interpretation means that it is often helpful to 
focus on vorticity rather than velocity in the analysis 
of particular problems. 

For two-dimensional flow we have seen that co = 
(0,0 ,-V 2 ip). The vorticity equation in this case re- 
duces to a nonlinear equation for <p: 


— (V 2 (p) - yyyijLNF 
dV ^ 3 (x,y) 


■■ vV>, 


where V 4 = (V 2 ) 2 , the biharmonic operator. 


3.4 Pifficulties with the NS Equation 


The nonlinearity of the NS equation represented by 
the term u • Vm presents a major difficulty for fun- 
damental theory. This is not the only difficulty, how- 
ever. The viscous term vV 2 m implies viscous dissipa- 
tion of kinetic energy (to heat), and associated irre- 
versibility in time. Furthermore, the influence of the 
pressure-gradient term is nonlocal. To see this, take 
the divergence of the equation, giving V 2 p = -s(x, t), 
where s = pV ■ [(m ■ V)u]. The solution of this Poisson 
equation (in three dimensions) is 


p(x,t) 



s(x’, t) 
\x - x'\ 


dV' , 


plus possible boundary contributions. Thus, p{x,t) 
is influenced by values of s(x',t), and therefore of 
u(x' , t) , at all points x' in the fluid, and the conver- 
gence of the integral for large \x - x'\ is slow. 

This combination of difficulties in relation to the NS 
equations presents an enormous challenge; even the 
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basic problem of proving regularity for all t > 0 of solu- 
tions of the NS equations evolving from smooth initial 
conditions remains unsolved at the present time. This 
is one of the Clay Mathematics Institute’s Millennium 
Prize Problems. 


4 Some Exact NS Solutions 

Exact solutions of the NS equations fall into two cate- 
gories: trivial flows, for which the inertia forces either 
vanish identically (as for steady rectilinear flow) or 
are exactly compensated by a pressure gradient, and 
self-similar flows, whose structure is to some extent 
dictated by dimensional considerations. 


4.1 Trivial Flows 

Chief among these are 

Poiseuille flow, driven by a constant pressure gradi- 
ent G = -(dp /dx)i, where i = (1,0,0), in a two- 
dimensional channel between rigid walls y = ±b, for 
which the velocity field is u = (G/2p)(b 2 - y 2 )i\ and 

Couette flow, driven by motion of the boundaries y = 
± b with velocities ±Vi, for which the velocity field is 
u = (V y /b)i. 


These flows serve as prototypes when questions of 
stability arise. 

Similar flows exist for fluid contained in the annular 
region between two concentric cylinders a < r < b. The 
Poiseuille flow driven by pressure gradient G parallel to 
the axis has velocity profile 


u(r) = 



- r 2 ) - (b 2 - a 2 ) 


log (b/r) 
log (b/a) 


}■ 


The Couette flow corresponding to rotation of the cylin- 
ders with angular velocities Oi, Qo has circulating 
velocity (around the axis) v (r) = Ar+B/r, where A and 
B are determined from the boundary conditions v (a ) = 
Qia, v(b) = Q 2 b. These two flows may be linearly 
superposed, giving a flow with helical streamlines. 

Slightly less trivial is the Burgers vortex for which 
the vorticity distribution is in the z-direction and has 
the Gaussian form 


to(r) = 


ry 

4ttv 


exp 


I 


yr 2 f 
4v y 


Here, T (= 2rr / (u(r)rdr) is the total strength of the 
vortex, and y (>0) is the rate-of-strain that must be 
imposed to keep the vortex steady against the erosive 
effect of viscous diffusion. 


4.2 Self-Similar Flows 


It will be sufficient to illustrate this type of flow by 
a simple example: Jeffery-Hamel flow. Let ( r,0 ) be 
plane polar coordinates, and suppose that fluid is con- 
tained between two planes 0 = ±tx and is extracted 
by a line sink Q at the origin; then, if u is the radial 
velocity, f a a urd6 = -Q for all r. Here, like v, Q has 
dimension L 2 /T, and we may define a Reynolds num- 
ber Re = Q/v. Furthermore, on dimensional grounds, 
the stream function for the resulting flow must take 
the form ip = Qf(6), where / is dimensionless. (The 
velocity u = r^dip /d0 is then purely radial.) Substitu- 
tion into the equation for the stream function in plane 
polar coordinates, 


8 _ 

dt 


(V 2 ip) 


1 d(ip, V 2 ip) 
r d(r,0) 


vVV, 


gives an ordinary differential equation (ODE) for f(0). 


f"" + 4f" + Ref'f" = 0, 


which may be integrated three times to give the velocity 
profile r -1 /'(0). What is important here is that, from 
dimensional considerations alone, as described above, 
the nonlinear PDE for ip has been reduced to an ODE, 
still nonlinear but nevertheless relatively easy to solve. 

A second example is that of two-dimensional flow 
toward a stagnation point on a plane boundary y = 0. 
The flow far from the boundary is the uniform strain 
flow for which ip ~ txxy, and conditions of imper- 
meability and no-slip, ip = dip /dy = 0, are imposed 
on y = 0. Here, dimensional analysis implies that 
ip = x(va) 1/2 f(ri), where 17 = y(cx/v) 1/2 , and again 
the PDE for 1 p reduces to an ODE for fig), which can 
be easily solved numerically. 

Further examples are 


• the axisymmetric flow due to a concentrated point 
force F applied at the origin, for which the veloc- 
ity field is everywhere inversely proportional to 
distance from the origin (the Squire-Landau jet ); 
and 

• the von Karman flow due to the differential rota- 
tion of two parallel discs about their axes. 


In the latter case, the flow between the discs has heli- 
cal streamlines. This flow has recently provided the 
basis for the VKS (von Karman sodium) experiment 
demonstrating dynamo action due to flow driven by 
counterrotating propellers. 
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5 Stokes Flow 

When Re <sc 1, inertial forces are negligible, and the NS 
equations simplify to the quasistatic Stokes equations 

3 7 

aij = 0, pW 2 u = Vp, V ■ u = 0. 

5.1 Some General Results 

Certain results concerning Stokes flows as governed by 
these equations can be proved in a straightforward way. 
First, the equations are now linear, so that if any two 
solutions are known, then any linear combination of 
them is also a solution. Second, if a function U (x) is 
defined on a closed surface S bounding a volume V of 
fluid, satisfying the condition f s U(x) ■ ndS = 0, then 
the solution of the Stokes equations in V correspond- 
ing to the boundary condition u(x) = U(x) on S exists 
and is unique. Moreover, the solution v(x) correspond- 
ing to the boundary condition v(x) = -U(x) is sim- 
ply v(x) = - u(x ), i.e., reversal of the boundary condi- 
tion reverses the flow everywhere. Thus, for example, a 
small spherical patch of dye that is distorted by a flow 
into a long thin filament can be restored to its origi- 
nal spherical shape (apart from the effect of molecular 
diffusion) by time-reversal of the motion of the fluid 
boundary. 

Next, there is a minimum dissipation theorem relating 
to the class C of kinematically possible flows in a closed 
volume V (i.e., flows satisfying merely V • u = 0 in V 
and the boundary condition u(x) = U(x) on S). Let 
<P = f v (Tijetj dV be the rate of dissipation of kinetic 
energy of any such flow. Then the unique Stokes flow 
u(x) satisfying this boundary condition minimizes <P 
within the class C. This result clearly does not extend 
to solutions of the NS equations (since these are within 
the class C and have greater <P than that of the unique 
Stokes flow). 

Finally, there is a reciprocity theorem that finds 
important application in the following subsection: let 
(Uf,tr^j) and (u^ 2 \ Oy ') be the velocity and stress 
fields corresponding to two Stokes flows with different 
boundary conditions u\ 1] = f/, U) , u i2] = u{ 2> on the 
surface S ; then 

j s a^U< 2) n i dS = j s a! 2) uj 1) n i dS. 

5.2 Flow Due to the Motion of a Particle 

From a historic viewpoint, the most important problem 
in this low-Reynolds-number regime is that solved by 
Stokes himself in 1851: the flow due to the motion of 


a rigid sphere through a viscous fluid. Here, we briefly 
consider the principles governing the flow due to the 
motion (translation plus rotation) of a rigid particle of 
arbitrary geometry, the fluid being assumed to be at 
rest at infinity. Let a = (3V /4rr ) 1/3 , where V is the vol- 
ume of the particle. Attention is naturally focused on 
the force F and torque G acting on the particle. The 
instantaneous motion of the particle is determined by 
the velocity U of its center of volume and its angular 
velocity fl, and the linearity of the Stokes equations 
implies linear relations between {F, G} and {[/, f 2} of 
the form 

Fi = -piaAijUj + a 2 BijQj), 

Gi = -p(a 2 CijU f + a^DijQj), 
where, by virtue of the above reciprocity theorem, 

Aij = Aji, Dij = Dji, Cij = Bji. 

These dimensionless tensor coefficients are deter- 
mined solely by the shape of the particle. Any symme- 
try imposes further constraints. If the particle is mirror 
symmetric (invariant under reflection with respect to 
its center of volume), then the (pseudo)tensor Cy must 
vanish. If the particle has at least the symmetry of a 
cube (i.e., is invariant under the group of rotations of a 
cube), then Ay and Dy must be isotropic: Ay = «5y, 
Bij = flSij. The case of a sphere is classic; in this case, 
as shown by Stokes, cx = 6rr, fi = 8 tt. 

Particles of helical shape are obviously not mirror 
symmetric, and for these, Cy f= 0. This means that such 
a particle freely sedimenting through a fluid must expe- 
rience a torque and will therefore also rotate. Equally, 
if such a particle rotates (through some internal mech- 
anism), then it will experience a force causing it also 
to translate. Microscopic organisms can propel them- 
selves through a viscous environment by adopting a 
swimming strategy that exploits this phenomenon. 

6 Inviscid Vorticity Dynamics 

In the formal Euler limit Re = oo, it is tempting to sim- 
ply set v = 0 in the NS equation and in the vorticity 
equation (section 3.3). Note, however, that, since v mul- 
tiplies the highest space derivative in these equations, 
the order of the equations is reduced in so doing, and 
it is not possible to satisfy the no-slip condition at rigid 
fluid boundaries in this Euler limit. We shall consider 
boundary effects in section 7; for the moment, we con- 
sider the evolution of an isolated blob of vorticity far 
from any fluid boundary. The vortex ring as visualized 
by convected smoke is a well-known prototype. Vortex 
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rings occur in a wide variety of circumstances, e.g., in 
the gas emitted impulsively in volcanic eruptions, or in 
the ocean, created by dolphins in the course of their 
underwater antics. 

6.1 Helmholtz’s Laws of Vortex Motion 

The study of vorticity goes back to Hermann von Helm- 
holtz (1858) and the subsequent work of P. G. Tait and 
Lord Kelvin that this stimulated. Helmholtz recognized 
that vortex lines are transported with the fluid and that 
vortex tubes have constant strength in the course of 
this motion. This means that, if the flow is such as to 
stretch a vortex tube, then its cross section decreases 
but the flux of vorticity along the tube (equivalent to the 
circulation around it) is constant. Helmholtz also main- 
tained that vortex lines must either be closed curves or 
end on fluid boundaries, but it is now known that this is 
incorrect; even for a localized blob of vorticity without 
any obvious symmetry, a vortex line will in general wind 
indefinitely, rather like a ball of wool, but in a chaotic 
manner. 

6.2 Conservation of Helicity 

The fact that vortex lines move like material lines 
within the fluid implies that any topological structure 
of the vorticity held is conserved for all time. For exam- 
ple, as recognized by Kelvin in 1867, if a vortex tube is 
knotted on itself, then this knot persists for all time. Let 
S be any closed surface moving with the fluid (a Lagran- 
gian surface ) on which to ■ n = 0, a condition that per- 
sists for all t. Then for each such surface containing a 
volume V, we may define 

3-f = \ u ■ to dV, 

Jv 


6.3 The Biot-Savart Law 


The relation inverse to to = V x u, V • u = 0, is given 
by the Biot-Savart law 


u(x, t) = 


1 f to(x', t) x (x - x') 
4tt . 


dV', 


\x - x'\ 3 

which shows how the velocity at any point may be 
instantaneously obtained from the vorticity distribu- 
tion, a purely kinematic result. Knowing u(x,t), Eule- 
rian fluid dynamics is now completely contained in the 
statement that the vorticity held to(x, t) is converted 
by the velocity held u(x,t) that is induced in this way. 


6.4 Vortex Ring Propagation 

This consideration enabled Kelvin to calculate the 
velocity of propagation of a vortex ring of circulation T 
in the form 

where R is the radius of the ring and a (<sc R) is the 
radius of its cross section. 

More generally, the velocity of a thin curved vor- 
tex hlament with parametric equation x = X(s,t ) is 
frequently assumed to be given by the local induc- 
tion approximation in the compact normalized form 
At — X$ X X$s- 


7 The Aerodynamics of Flight 

Mastery of the aerodynamics of flight was arguably the 
greatest engineering accomplishment of the twentieth 
century. This mastery required an understanding of the 
role played by viscosity in the immediate neighborhood 
of an aircraft wing— in other words, of boundary-layer 
theory, or what is known at a more sophisticated level 
as the theory of matched asymptotic expansions. 

7.1 The External Irrotational Flow 


the helicity within V. This helicity is constant, as may 
be proved directly from the Euler equations, and it is 
known that this integral provides a measure of the link- 
age of vortex lines within V. For example, if the vor- 
ticity distribution consists of two linked vortex tubes 
with (constant) circulations i < i and kz, and if V contains 
just these two linked tubes, thenTf = ±2nK\Kz, where 
n is the Gauss linking number of the linkage, and the 
sign is chosen according to whether the linkage is right- 
or left-handed. This result provides a bridge between 
the Euler equations of inviscid fluid mechanics and a 
fundamental concept of topology. 


It is convenient to adopt a frame of reference fixed in 
the aircraft; in this frame, the fluid velocity at infinity 
is uniform, (-(7,0,0), where U is the speed of flight. 
In a steady two-dimensional flow, the vorticity equa- 
tion reduces when v = 0 to Dco/Df = 0, so that vortic- 
ity is constant on streamlines. The vorticity is zero on 
every streamline that comes from far upstream; thus 
V x u = 0; i.e., the flow is irrotational in the Euler 
limit. The pressure in a steady irrotational flow is found 
by integrating Euler’s equation, giving Bernoulli’s theo- 
rem p = po - jpu 2 , where po is a constant reference 
pressure. Thus, where |u| is high, p is low, and vice 
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versa. Aerodynamic lift on the wings results from lower 
air speed (and thus higher pressure) on the lower sur- 
face of the wings, a matter of some comfort for jet-age 
travelers. 

The classical theory of irrotational flow past a cir- 
cular cylinder allows for a circulation k, the flow due 
to an apparent point vortex at the center of the cir- 
cle. Given an airfoil cross section with a sharp trailing 
edge, the exterior region can be conformally mapped 
to the exterior of a circle. This provides a solution for 
the irrotational flow past the airfoil, still including the 
arbitrary circulation k. There are two stagnation points 
(where u = 0) on the airfoil surface: one on the lower 
side and one on the upper side. The position of these 
stagnation points varies with k, and there is a unique 
value of k = k c that moves the upper stagnation point 
exactly to the trailing edge; for this value, the airflow 
leaves the trailing edge smoothly. The Kutta-Joukowski 
hypothesis asserts that the critical value k c is actually 
realized in practice. The lift per unit span of aircraft 
wing is then evaluated as I = pU k c . In order of magni- 
tude, k c ~ l/asintx, where a is the streamwise extent 
of the wing and a is the angle of attack (see figure 1); 
so L ~ paU 2 sincx, independent of the value of v. The 
lift on the whole wing is therefore pU 2 times the area 
of the wing projected on the flight direction. 

If any lower value of k is chosen, then air flows 
round the sharp trailing edge T, at which point there 
is a singularity of air velocity. The vorticity generated 
by viscosity between T and the stagnation point S on 
the upper side of the wing is converted downstream 
into the wake, with a compensating increase in k\ this 
process persists until the Kutta-Joukowski condition is 
indeed satisfied (the stagnation point then suppressing 
the singularity). 

It remains to explain why, if the lift force on which 
flight depends is independent of the very small value 
of v in air, the fact that v f= 0 is nevertheless essential 
for this lift to exist at all. This paradoxical role of weak 
viscosity forces us to focus attention on the immediate 
vicinity of the wing surface, i.e., the boundary layer. 

7.2 The Boundary Layer 

Let O be any point on the wing surface, and let O xy be a 
locally Cartesian coordinate system, with Ox tangential 
to the boundary and O y normal to it. The tangential 
velocity increases from zero on the boundary to U{x) 
(determined from irrotational theory) in the external 
stream. In this thin layer, V 2 ip ~ d 2 i p /dy 2 = ip yy , so 


the steady vorticity equation simplifies to 


o(x,y) 


This 


integrates with respect to y to give 


VyVxy - IpxWyy = G(x) + Vlp y yy, 
or, in terms of the tangential and normal velocity 
components u = ip y , v = -i// x , 

UU X + VUy = G(X) + VUyy. 

Here, G(x) is the “constant of integration,” which may 
be identified with the “-dp 1 3 x" of the NS equation; it 
turns out, therefore, that the pressure is independent 
of the normal coordinate y; in fact, since u ~ U(x) for 
large y, we must have G(x) = U(dU/dx). The external 
pressure is impressed on the boundary layer. If G(x) > 
0, the pressure gradient tends to accelerate the flow and 
is described as favorable, while if G(x) < 0, it tends to 
retard the flow and is described as adverse. 

At this stage, dimensional analysis may be used to 
great effect. First note that v may be scaled out of 
the above equation through defining y = y /v 1/2 , ip = 
ip /v 1/2 , and note that y = 0(v 1/2 ) wheny = 0(1). We 
now ask under what circumstances a similarity solution 
of the form 

ip = (vxU{x)) 1/2 f(q), with q = ) 

is possible. It turns out that this requires that U (x) = 
Ax m (soG(x) = mA 2 x 2m ~ 1 ), for some constants A, m, 
and that f(r ]) then satisfies the equation 

f” + \(m + 1)//" + m(l - f' 2 ) = 0, 

with boundary conditions /(0) = /'( 0) = 0 ,/'(<») = 1, 
the last of these coming from the required matching 
to the external flow. For the particular value m = 0, 
the equation is known as the Blasius equation, and it 
describes the boundary layer on a flat plate aligned 
with the stream, with zero pressure gradient. More 
generally, it is known as the Falkner-Skan equation. 

A well-behaved solution of the Falkner-Skan equa- 
tion is one for which 0 < /' < 1 and f" > 0 for all 
positive q, i.e., one for which the tangential velocity 
rises smoothly from zero on the boundary y = 0 to 
its asymptotic value in the external stream. It is known 
that such a solution exists for all positive m, and even 
for mildly negative m in the range m > -0.09. How- 
ever, no well-behaved solution exists for m < -0.09, 
and the only solutions in this range of adverse pres- 
sure gradient exhibit reversed flow (/' < 0) near the 
wall. This suggests that a boundary layer cannot survive 
against a strong adverse pressure gradient. 
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Figure 1 Cross section of an aircraft wing (airfoil), and streamline of the irrotational flow relative to the wing in steady flight 
at angle of incidence a. Additional circulation associated with viscous shedding of vorticity moves the stagnation point S 
into coincidence with the trailing edge T, causing the flow to leave the trading edge smoothly. 


This conclusion is supported both by numerical solu- 
tion of the full PDE for ip and by observation of flow 
past bluff (rather than streamlined) bodies, for which 
the boundary layer is seen to separate from the bound- 
ary, creating a substantial wake region in which the flow 
recirculates with nonzero vorticity. 

7.3 A Comment on Flow Separation 

Boundary-layer separation at high Reynolds number is 
one of the most difficult aspects of fluid mechanics, 
and this has led to theoretical investigations of great 
sophistication. Separation is responsible for the aero- 
dynamic phenomenon of stall with consequential loss 
of lift and increase of drag when an aircraft’s angle of 
attack (the inclination of the wings to the oncoming 
stream) increases beyond a critical value. Control of 
separation is therefore crucial to maintenance of stable 
flight. 

We may note, however, that separation occurs even in 
low-Reynolds-number situations. Consider, for exam- 
ple, an oncoming shear flow («y,0,0) over a rigid 
boundary y — 0 that takes a sudden Gaussian dip in 
the neighborhood of x = 0: 

y = -yoe~ (xl5)1 , 

where yo » S. The Reynolds number can be taken to 
be Re = <x5 2 /v. When Re <k 1, this flow separates at 
some point x = x c = O ( - 8) , in the sense that a stream- 
line detaches from the boundary at this point, with 
reversed flow near the boundary immediately beyond 
it. This type of low-Reynolds-number separation is 
well documented. Now consider what happens as the 
Reynolds number is continuously increased by reduc- 
ing v while keeping the geometry and the upstream 
velocity constant. The separation will undoubtedly per- 
sist, although the precise location of the separation 


point x c may be expected to change slightly with 
increasing Re. 

An entirely different consideration, namely that of 
flow instability, also arises with an increase in Re. We 
now turn to this important branch of fluid dynamics. 

8 Flow Instability 

In consideration of the instabilities to which steady 
flows may be subject, one may distinguish between fast 
instabilities, i.e., those that are of purely inertial origin 
and have growth rates that do not depend on viscosity, 
and slow instabilities, which are essentially of viscous 
origin and have growth rates that therefore tend to zero 
as v — 0, or equivalently as Re = UL/v — ■ oo. Exam- 
ples of fast instabilities are the Rayleigh-Taylor insta- 
bility that occurs when a heavy layer of fluid lies over a 
lighter layer, the centrifugal instability (leading to Tay- 
lor vortices ) that occurs in a fluid undergoing differen- 
tial rotation when the circulation about the axis of rota- 
tion decreases with radius, and the Kelvin-Helmholtz 
instability that occurs in any region of rapid shearing 
of a fluid. The best-known example of a slow instabil- 
ity is the instability of pressure-driven Poiseuille flow 
between parallel planes, which is associated with sub- 
tle effects of viscosity in critical layers near the bound- 
aries. The dynamo instability of magnetic fields in elec- 
trically conducting fluids is usually also diffusive in ori- 
gin (through magnetic diffusivity rather than viscosity) 
and may therefore also be classed as a slow instability 
(although exotic examples of fast dynamo instability 
have also been identified). 

8.1 The Kelvin-Helmholtz Instability 

This may be idealized as the instability of a vortex 
sheet located on the plane y = 0 in an inviscid fluid 
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of infinite extent. The basic velocity field is then taken 
to be (/ = (+17,0,0) for y > 0 or y < 0, respec- 
tively. We suppose that the vortex sheet is slightly 
deformed to y = g(x,t ) = r)(t)e lkx , where the real 
part is understood. The associated perturbation veloc- 
ity field is irrotational, and it is given by u = V <p 1,2 
for y > n and y < 17, respectively. Incompressibility 
implies that V 2 <£i ,2 = 0, and the requirement that the 
disturbance decays at y = ±00 then leads to <£12 = 
<P \ t 2 (t) exp (+ky + i kx). The sheet moves with the fluid, 
according to Helmholtz; moreover, pressure is contin- 
uous across it. These two conditions, when linearized 
in the small perturbation quantities, lead to the ampli- 
tude equation d 2 rj/dt 2 = \k 2 U 2 r). The vortex sheet is 
therefore subject to an instability r\ ~ exp e <7t , where u 
is givenby the dispersion relation a = \\kli\. The phys- 
ical interpretation of this instability is that the surface 
vorticity on the sheet tends to concentrate at the inflec- 
tion points where r\ x > 0, r\ xx = 0, and the induced 
velocity then tends to amplify the perturbation of the 
sheet. 

A notable feature of this dispersion relation is that as 
the wavelength A = 2Tt/k of the disturbance decreases 
to zero, its growth rate cr increases without limit. 
This instability is therefore quite dramatic. As the dis- 
turbance grows, nonlinear effects inevitably become 
important. In the nonlinear regime, the vorticity con- 
tinues to concentrate at the upward-sloping inflection 
points, the distribution of vorticity on the sheet becom- 
ing cusp-shaped at a finite time, beyond which it is 
believed that spiral windup of the vortex sheet around 
each such cuspidal singularity occurs. 

The same type of Kelvin-Helmholtz instability occurs 
for any parallel flow having an inflectional velocity pro- 
file] ; it is ubiquitous in nature and is one of the main 
mechanisms of transition from laminar to turbulent 
flow. 

8.2 Transient Instability 

This is a very different type of instability, but it is also 
a key ingredient in the process of transition to turbu- 
lence for many flows. It is well illustrated with refer- 
ence to plane Couette flow U = (ay, 0,0) for |y| < b, 
a flow that is known to be stable to conventional dis- 
turbances of normal-mode type. Here, we consider the 
central region of such a flow and neglect the influence 
of the boundaries. In this region the flow is subject to 
disturbances of the form 

u = A(t)e ikW x , A(t) ■ k(t) = 0, 


where k(t) = (fci,/d2 - afcit,k3). This type of dis- 
turbance, whose wave vector k(t) is itself sheared by 
the flow, is called a Kelvin mode. Substitution into the 
NS equation and elimination of the pressure leads to 
an exact solution for the amplitude Ait). This solu- 
tion reveals that, when kj « k 2 + kl and viscous 
effects are negligible, the component Ai(t) increases 
linearly (rather than exponentially) for a long time. This 
instability may be attributed to the term u ■ VU = 
U2(d[//dy) in the NS equation, which corresponds 
to persistent transport in the y-direction of the x- 
component of mean momentum. This is the essence 
of transient, as opposed to normal-mode, instability. 
Transient instability is responsible for the emergence 
of streamwise vortices in shear flows. These vortices 
can themselves become prone to secondary instabilities, 
almost inevitably leading to transition to turbulence. 

8.3 Centrifugal Instability 

Consider now the circular Couette flow between rotat- 
ing cylinders, for which the velocity field takes the 
form u = (0,v(r),0) in cylindrical polar coordinates 
(r, 0, z). The radial pressure gradient required to bal- 
ance the centripetal force is dp /dr = pv 2 /r. For the 
moment, neglect viscous effects. Suppose that a ring 
of fluid of radius r expands to radius r\ = r + Sr, 
its angular momentum remaining constant; its veloc- 
ity then becomes v\ = rv(r)/ri.The centripetal force 
acting on the ring will now be p(rv /r\) 2 , and if this 
is greater than the local restoring pressure gradient 
pvf/ri, the ring will continue to expand. The condition 
for instability is then \ rv\ > \r\V (r\) | or, equivalently, 

-r~\rv(r)\ < 0. 

dr 

Thus, if the angular momentum (or, equivalently, the 
circulation) decreases outward, then the flow is prone 
to centrifugal instability (the Rayleigh criterion). 

Such a flow is realized between coaxial cylinders, 
when the inner cylinder (of radius a) rotates about its 
axis with angular velocity Q while the outer cylinder is 
stationary. If the gap d between the cylinders is small 
compared with a, then, as shown by G. I. Taylor in 1923, 
when due account is taken of viscous effects, the crite- 
rion for instability to axisymmetric disturbances of the 
above kind becomes, to good approximation, 


O ^ A / O. 

The dimensionless number Ta = ad 3 n 2 /v 2 is known 
as the Taylor number. The instability when Ta > 1708 
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manifests itself as a sequence of axisymmetric Taylor 
vortices of nearly square cross section and alternating 
sign of azimuthal vorticity. 

There is a close analogy between centrifugal instabil- 
ity and the convective instability of a horizontal layer of 
fluid of depth d, heated from below ( Rayleigh-Benard 
convection). Here, the analogue of the Taylor number 
is the Rayleigh number Ra = g($d 3 AT/vi<, where k 
is the thermal diffusivity of the fluid, (I its coefficient 
of thermal expansion, and AT the temperature dif- 
ference between the bottom of the layer and the top. 
The analogy is so close that the critical Rayleigh num- 
ber for instability is also Ra = 1708. For Ra just a 
little greater than 1708, a steady pattern of convec- 
tion is established: either rolls with horizontal axes or 
convective cells with hexagonal planform, the choice 
being influenced by weak nonlinearity and other subtle 
effects. 
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IV. 2 9 Magnetohydrodynamics 

David W. Hughes 


1 Introduction 

Magnetohydrodynamics (MHD) is the study of the 
motion of an electrically conducting fluid or plasma 
in the presence of a magnetic field. It is employed 
principally in relation to astrophysical and geophysical 


magnetic fields, but there are also several important 
terrestrial and industrial applications. On the grand 
scale, magnetic fields are both ubiquitous and dynami- 
cally significant; determining a theoretical understand- 
ing of their behavior is therefore crucial to under- 
standing the dynamics of star formation, the interstel- 
lar medium, accretion disks, stellar atmospheres, and 
stellar and planetary interiors. Figure 1 shows a high- 
resolution image taken from space of the solar atmo- 
sphere, revealing the scale and complexity of the mag- 
netic field. The most ambitious terrestrial application 
of MHD lies in the quest to harness energy from nuclear 
fusion; controlling the confinement of fusion plasma in 
tokamaks through the imposition of strong magnetic 
fields poses a formidable theoretical and experimen- 
tal challenge. In a rather different industrial direction, 
MHD seeks to explain the complex turbulent motion 
that arises via the regulation, by magnetic fields, of the 
flow of liquid metals in casting and refining operations. 

The origins of MHD might be traced to Larmor’s 
short, but highly influential, 1919 paper entitled “How 
could a rotating body such as the sun become a mag- 
net?” in which it was postulated that the swirling 
motions inside stars could maintain a magnetic field. 
This was followed by the early theoretical development 
of the subject in the 1930s and 1940s by pioneers such 
as Alfven, Cowling, Elsasser, and Hartmann. A century 
earlier, in what might be regarded as a precursor to 
MHD, Faraday had investigated the role of a moving 
fluid conductor by attempting to measure the poten- 
tial difference induced by the Thames flowing in the 
Earth’s magnetic field. 

The governing equations of MHD are derived from 
combining the ideas of fluid dynamics with those of 
electromagnetism. In spirit, though, it is closer to the 
former. The “fluid” under consideration in MHD may 
be truly a fluid, such as the liquid iron in the Earth’s 
outer core, or, as in astrophysical contexts, it may be an 
ionized gas or plasma. To a very good approximation, 
collision-dominated plasmas, as found in stellar inte- 
riors, can be treated as a single fluid, and their behav- 
ior is therefore well described within MHD. However, in 
other contexts, e.g., stellar atmospheres, a more com- 
plex (multifluid) plasma description may sometimes be 
required. 

Magnetohydrodynamics extends fluid dynamics in 
two significant ways: by the addition of a new equation, 
the magnetic induction equation, and by the incorpora- 
tion of the magnetic force, the Lorentz force, into the 
MOMENTUM (NAVIER-STOKES) EQUATION [111.23 §2]. The 
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Figure 1 An extreme ultraviolet image of the sun’s atmo- 
sphere, taken by the Solar Dynamics Observatory. Bright 
regions are locations of intense magnetic activity. The 
magnetic field emerges from the solar interior, forming 
sunspots on the surface and arched structures ( coronal 
loops) in the solar atmosphere. 


induction equation describes the influence of the veloc- 
ity on the magnetic held; conversely, the Lorentz force 
describes the back reaction of the magnetic held on the 
velocity. Thus, the held and how are linked in a com- 
plex, nonlinear fashion. MHD therefore contains all of 
the complexities and subtleties of (nonmagnetic) huid 
dynamics together with some unambiguously magnetic 
novel phenomena. 

2 The Magnetic Induction Equation 

Magnetohydrodynamics is a nonrelativistic theory, val- 
id for huid velocities much less than the speed of light; 
its starting point is, therefore, the set of “pre-Maxwell” 
equations of electromagnetism, in which the displace- 
ment current is neglected. The electric held E, magnetic 
held B, and electric current J (real three-dimensional 
vectors) then satisfy Ampere’s law, Faraday’s law, and 
the divergence-free condition on the magnetic held. In 
MKS units, these are expressed as 

V x B = poJ, || = -Vx£, V ■ B = 0, 

where po is the magnetic permeability. To obtain a 
closed system of equations, one more relation (Ohm’s 
law) is needed, linking E, B, and J. In MHD it is custom- 
ary to adopt the simplest form of Ohm’s law, namely, 


J' = crE' , where the helds J' and E' are measured in 
a frame of reference moving with a fluid element and 
where cr is the electrical conductivity. Relative to a fixed 
reference frame, this becomes 

J = u(E + MX B), 

where u is the huid velocity. (Additional plasma pro- 
cesses — such as an electron pressure gradient or the 
Hall effect, which are not included in classical MHD— 
can be incorporated via a generalized version of Ohm’s 
law.) Eliminating E and J and, for simplicity, assuming 
that the conductivity is uniform leads to the magnetic 
induction equation: 

dB 

= V x (u x B) + g\7 2 B, (1) 

at 

where g = 1/ (a po) is the magnetic diffusivity. In MHD 
primacy is afforded to the magnetic held; the electric 
held and current are secondary, but, if needed, they can 
be evaluated from a knowledge of B. Similarly, Gauss’s 
law relating the electric held to the electric charge plays 
no explicit role in the above arguments; it simply pro- 
vides a means of determining the charge once E has 
been determined. 

The induction equation describes the evolution of 
the magnetic held subject to advection and diffusion. A 
measure of the relative magnitude of these two terms is 
given by the magnetic Reynolds number Rm = UL/g, 
where U and I are representative velocity and length 
scales. In astrophysical bodies, characterized by vast 
length scales and high conductivities, Rm is invariably 
large and often extremely so (in the solar convection 
zone; for example, Rm ~ 10 9 at depth and Rm ~ 10 6 
close to the surface; in the interstellar medium, Rm a 
10 17 ). On the other hand, for most industrial hows of 
liquid metals, Rm is small (< ICh 2 ). 

On expanding the curl of the vector product, and 
using V B = 0, equation(l) can alternatively be written 
as 

Y)B 

— = B ■ Vu - (V ■ u)B + nV 2 U, 

D t 

where D/Dt = 3/3 1 + u ■ V denotes the Lagran- 
gian or material derivative “following the huid.” The 
three terms on the right-hand side denote, respectively, 
contributions to magnetic held evolution due to held 
line stretching, the compressibility of the huid, and 
magnetic diffusion. 

It is often helpful to decompose the (solenoidal) 
magnetic held into poloidal and toroidal components: 

B = £p + £ t = VxVx ( Pr ) + V x (Tr), 
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where P and T are scalar functions of position and r 
is the position vector. For axisyimnetric fields, Bj is 
azimuthal and Bp meridional. 

3 Perfectly Conducting Fluids 

Motivated by the fact that Rm » 1 in astrophysics, 
it is instructive to consider the idealized dynamics of 
a perfectly conducting fluid (often referred to as ideal 
MHD), for which Rm is formally infinite and, from (1), 
B is described simply by 

dB 

— = V x (u x B) (2) 

(though of course one always needs to exercise cau- 
tion when dropping the term with the highest deriva- 
tive). Here, we consider two important properties of the 
magnetic field in a perfectly conducting fluid. 

3.1 Flux Freezing 

Equation (2) for the magnetic field B is formally iden- 
tical to the vorticity equation of an inviscid fluid and, 
as such, we may seek analogs of the helmholtz vor- 
tex theorems [IV.28 §6.1]. The proofs go through unal- 
tered, since they do not rely on the additional con- 
straint that the vorticity is the curl of the velocity. Thus, 
for any material surface S m moving with the fluid it can 
be shown that 

D f 

— B ■ ndS = 0, 

Df Js m 

where n is the normal to the surface. In other words, 
the flux through a surface moving with the fluid is 
conserved. It is then straightforward to show that two 
fluid elements initially on the same magnetic field line 
will remain on that line for all subsequent times. This 
is Alfven’s famous result that the magnetic field lines 
move with the fluid; this is often referred to as the field 
lines being frozen into the fluid. One of the earliest 
results of MHD— Ferraro’s 1937 law of isorotation— is a 
particular case of flux freezing. It states that if the angu- 
lar velocity Q of a flow is constant on lines of poloidal 
magnetic field (i.e., Bp ■ WQ = 0), then there is no ten- 
dency to generate toroidal field through the pulling 
out of poloidal field. Conversely, if Bp ■ VQ 0, then 
toroidal field is “wound up” from a poloidal component 
in a differentially rotating flow, an important consid- 
eration given that most astrophysical bodies support 
large-scale shearing motions. 

A related result comes from combining the induc- 
tion equation with the equation for the conservation 


of mass to give 


Dt 




Thus B/p satisfies the same equation as a material line 
element and, consequently, will increase or decrease 
in magnitude in direct proportion to the stretching or 
compression of such an element. One important con- 
sequence is the amplification of the magnetic field dur- 
ing the gravitational collapse of astronomical bodies. 
When this is particularly dramatic, as in the formation 
of a neutron star, in which the radius can decrease by 
a factor of O(10 5 ), it leads to an immense (O ( 10 10 )) 
increase in the magnetic field strength. 


3.2 Magnetic Helicity 

Since the magnetic field B is solenoidal, it can be 
expressed as B = V x A, where A is the magnetic vec- 
tor potential. Then, by analogy with the helicity of 
a fluid flow [IV.28 §6.2], the magnetic helicity of a 
volume V bounded by a magnetic surface (i.e., one on 
which B ■ n = 0) is defined as 

M = [ A B dV; 

Jv 

3-f is invariant to gauge transformations A — A+ V<£. 

The magnetic helicity provides a measure of the link- 
age of magnetic flux tubes. In a perfectly conducting 
fluid, in which field lines are frozen to the fluid, we 
would therefore expect this topological property to be 
preserved for all times. This can indeed be proved 
straightforwardly from a combination of the induc- 
tion equation and the equation for the conservation 
of mass. With finite diffusivity, 3-f is no longer con- 
served. It has, though, been conjectured that 3-f is, 
nonetheless, “approximately conserved” in high -/bn 
fluids, and that in decaying MHD turbulence, magnetic 
energy will decay, whereas magnetic helicity will, to a 
first approximation, remain constant. 


4 Kinematic Considerations 

The interaction between the two processes of advec- 
tion and diffusion, encapsulated by the induction equa- 
tion, is brought out in the following simple important 
examples, in which we shall suppose that the flow is 
prescribed. 

4.1 Stagnation Point Flow 

The evolution of a unidirectional magnetic field, B = 
(0,B(x,t),0), say, in the two-dimensional stagnation 
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point flow m = A(-x, y, 0) is governed by the equation 


SB , dB , „ 
— - Ax— = A B 
ot ox 


n 


d 2 B 
dx 2 ' 


Magnetic field is brought in toward x = 0 by the 
flow, but in a region of strong field gradients, it is 
then subject to diffusion. A steady state is attained in 
which these processes balance. Two solutions are of 
particular interest, providing prototypical examples of 
fundamental astrophysical processes. 

For a magnetic field of one sign, 

B(x) = S max exp(-Ax 2 /2t7), 


affording the simplest model of the concentration of 
field into a flux rope. If the total magnetic flux is 
finite, and equal to BqL, say, then it accumulates into a 
rope of width O (Rm~ 1/2 L) and maximum field strength 
0(Rm 1/2 Bo). 

For a field that changes sign, vanishing at x = 0, 

B(x ) = —j= exp(-x"A/2q) exp(s 2 /2)ds, 

VflA Jo 

where E is the (constant) electric field in the z-direction. 
This provides a simple description of a current sheet 
formed by the annihilation of magnetic fields of oppo- 
site sign. It is an important ingredient in the more com- 
plex process of magnetic reconnection, in which mag- 
netic energy is released rapidly through a change in the 
topology of the field. 


4.2 Flux Expulsion 


In a swirling flow, a magnetic field is distorted and 
amplified; its transverse scale of variation decreases, 
leading to diffusion becoming important and the field 
being annihilated, or expelled from the flow. An ex- 
tremely complicated version of this process takes 
place, for example, in the turbulent convection cells 
observed at the solar surface. The underlying physics, 
and in particular the timescale of expulsion, can, 
though, be understood via the idealized example of a 
single laminar eddy, with a field in the plane of the flow. 
Consider the flow u = rQ(r)eg in plane polar coordin- 
ates, with a field initially uniform of strength Bq and 
parallel to the direction 0 = 0. Expressing the magnetic 
field as B = V x Ae z and using the Fourier decomposi- 
tion A = Bq Im(/(r, t)e 10 ) gives the following equation 
for /: 


df 

at 


+ 


with f(r, 0) 


, , Id 2 1 3 1 \ £ 

■ {r)f = n [^ 2 + r^-V 2 ) f ’ 

'. This has the asymptotic solution 


/ ~ rexp(-if2t - gO' 2 t 2 /3); 


significantly, flux expulsion occurs on an 0(Rm 1/3 ) 
timescale, as first proposed by Weiss in 1966, much 
faster than the O(Rm) diffusive (or Ohmic) timescale. 

5 The Lorentz Force 

The induction equation (1) describes the evolution of 
the magnetic field B under the influence of a fluid veloc- 
ity u. However, except when very weak, the magnetic 
field is not passive, but itself exerts a force on the veloc- 
ity, to be incorporated into the Navier-Stokes equation. 
This is the Lorentz force, defined by 

J x B = — (V x B) x B, (3) 

ho 

after substituting for the current from Ampere’s law. 
(Under the assumptions of MHD, the electrostatic force 
is negligible.) MHD is thus described by a coupled non- 
linear system of partial differential equations. Note 
that the Lorentz force has no influence on the motions 
along magnetic field lines; these must result from other 
forces such as gravity or pressure gradients. 

5.1 Magnetic Pressure and Tension 

The Lorentz force (3) maybe decomposed as 

/ d2 \ | 

JxB= -V — + — B VB, (4) 

V 2po / Po 

where B denotes the magnitude of the magnetic field B. 
The first term represents the gradient of magnetic pres- 
sure, p m = B 2 /2po. The total pressure is then the sum 
of the gas and magnetic pressures; the ratio of these, 
pi Pm, is known as the plasma-fi. In gases, the ramifica- 
tions of the magnetic pressure can be extremely impor- 
tant. An isolated tube of magnetic flux, in total pressure 
balance with a nonmagnetic atmosphere, will necessar- 
ily have a reduced gas pressure. If the tube is also at the 
same temperature as its surroundings, then, from the 
perfect gas law, it will have a lower density. It will there- 
fore rise, a phenomenon known as magnetic buoyancy. 
As first proposed by Parker, this mechanism is believed 
to be instrumental in bringing up magnetic flux tubes 
from the interior of the sun to the solar surface and on 
into the solar atmosphere (see figure 1). 

Writing B = Bs, in terms of the unit vector s along 
the Held, the second term in (4) can be decomposed as 

B d „ B dB _ B 2 ds 
— — (Bs) = — — s+ — — 

Po ds po ds po ds 

d t B 2 \ „ B 2 it 

ds V 2po J po R c ’ 
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where n is the principal normal to the magnetic field 
line and R c is its radius of curvature. The first term 
exactly opposes the component of the magnetic pres- 
sure along field lines. The second represents the mag- 
netic tension, showing that field lines, when bent, pos- 
sess an elastic restoring force. 

5.2 Force-Free Fields 

In low-/-! plasmas, such as that of the solar corona, the 
magnetic field is dominant. Static or, more accurately, 
quasistatic magnetic structures can then be modeled as 
possessing magnetic fields that are force free, i.e., the 
Lorentz force vanishes. From (3) this implies that B and 
J are parallel and hence that 


condition (u = 0 at y = ±1) and a magnetic boundary 
condition 

db 1 

± - — i — b = 0 at y = ± 1 , 
d y c 

where c = 0 corresponds to electrically insulating walls 
and c — oo to perfectly conducting boundaries. The 
resulting velocity is given by 

^ 1/ 1 + c \/ cosh(Hay)\ 

Ha \cHa + tanhFfa/ V cosh Ha )' 

For small Ha, the classical parabolic hydrodynamic 
profile is recovered. For Ha » 1 , the velocity is strongly 
reduced; furthermore, the profile becomes flat across 
the bulk of the channel, with exponential boundary 
layers of width 0(Ha~ l ). 


V x B = A B, (5) 

where A is a scalar, which may depend on position. Vec- 
tor fields satisfying (5) are known as Beltrami fields. It 
follows that B ■ V A = 0, namely, that A cannot vary along 
an individual field line. The simplest case is when A is a 
constant; in a cylindrically symmetric system, for exam- 
ple, this leads to a magnetic field with the following 
components, expressed as Bessel functions: 

B r = 0, Bg = BoJi(Ar), B z = B 0 Jo(\r). 


5.3 Hartmann Flow 


The influence of the magnetic tension is illustrated by 
the flow of an electrically conducting fluid along a chan- 
nel with a transverse magnetic field. Flows such as this, 
which were first studied theoretically and experimen- 
tally by Hartmann in the 1930s, have received consider- 
able attention owing to their importance in liquid-metal 
MHD. Suppose that an incompressible viscous fluid, of 
density p and kinematic viscosity v, is driven by a uni- 
form pressure gradient in the xr-direction between two 
parallel planes at y = ±d with an imposed uniform 
magnetic field Boy. The flow u(y)x vanishes on the 
bounding planes and is fastest in the center of the chan- 
nel; it thus pulls out the imposed field to generate a 
component of field b(y)x. The field lines then possess 
a magnetic tension, which acts to resist the motion. A 
steady state is possible in which the stretching of the 
field lines and their “slippage” due to diffusion are in 
balance. The nondimensional steady-state momentum 
and induction equations can be written concisely as 


Ha 


db d 2 u 
dy + dy 2 


, rr du d 2 b 
— — 1, Ha— — + ^ ~ — 0, 

dy dy 2 


where the Hartmann number Ha = dBoy/cr / pv. These 
are to be solved subject to the no-slip boundary 


6 Magnetohydrodynamic Waves 

The interplay in MHD between the velocity and the mag- 
netic field is brought out most vividly through the sup- 
port of a variety of wave motions. Such waves, with 
an extended range of spatial and temporal scales, are 
revealed in spectacular movies of the magnetic field in 
the solar atmosphere. 

Most striking is the occurrence of what are known 
as Alfven waves. These can be analyzed in the sim- 
plest system of ideal incompressible MHD. Consider 
linear plane-wave perturbations, varying as exp(i(k ■ 
x - cot)), to a homogeneous equilibrium state with a 
uniform magnetic field Bo- Combining the momentum 
and induction equations yields the dispersion relation 

m 2 = k\V 2 , 

where the Alfven velocity Va = Bo/y/pop, and k\\ is the 
component of the wave vector parallel to the imposed 
field. Alfven waves, which are transverse, are driven 
solely by the magnetic tension and are therefore, in 
some sense, analogous to waves on a stretched string. 

Compressible atmospheres support two additional 
modes, known as magnetoacoustic waves, with disper- 
sion relation 

to 4 - (o 2 k 2 (V| + Vl) + k 2 k 2 V$Vl = 0, 

where the speed of sound Vs = \JyplP- The two possi- 
ble solutions for the modulus of the phase speed \a>/k\ 
are designated as fast and slow modes. The properties 
of the phase and group velocities for the various MHD 
waves are demonstrated most clearly by a polar dia- 
gram, as shown in figure 2. The most significant differ- 
ences are revealed by the group velocities: whereas for 
the fast mode, energy is propagated in all directions, 
for the slow mode, energy propagates in only a small 
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Figure 2 (a) Phase velocities and (b) group velocities for 
the various MHD waves with an imposed magnetic field 
as shown. Solid lines denote the fast and slow waves; the 
Alfven waves are denoted by the dashed lines in (a) and the 
solid circles in (b). The ratio of Alfven to sound speeds (or 
the inverse ratio; the picture is unaltered) is 0.8. 


range of angles around the direction of the imposed 
field and for the Alfven waves, energy is conveyed, at 
speed Va, only in the direction of Bq. 

7 Dynamos 

Dynamo theory — the study of the maintenance of mag- 
netic fields — constitutes one of the most mathemati- 
cally intriguing and physically significant areas of MHD. 
Put starkly, there are just two possibilities for the 
occurrence of any magnetic field observed in an astro- 
physical body: either it is a slowly decaying “fossil 
field,” trapped in the body at its formation, or, alterna- 
tively, it is maintained by the inductive motions of the 
plasma within the body. The latter process is known 
as a magnetohydrodynamic dynamo. For the Earth 
the issue is clear-cut. Paleomagnetism reveals that the 
Earth’s field has persisted for O(10 9 ) years, whereas 
its decay time (I 2 //?, where L is the length scale of the 
conducting outer core and g its magnetic diffusivity) is 
several orders of magnitude shorter: O(10 5 ) years. The 
geomagnetic field cannot therefore be simply a decay- 
ing fossil field; it must be maintained by some sort of 
dynamo resulting from the flows of the liquid metal in 
the outer core. The case for dynamo action in the sun 
(and, by extension, similar stars) is rather different. The 
Ohmic decay time for the solar field is long, O(10 9 ) 
years, comparable with the lifetime of the sun itself, 
and so on these grounds alone one cannot rule out a 
primordial field explanation. However, the solar mag- 
netic field exhibits variations on very short timescales, 
with the entire field reversing every eleven years or so, 
and it is extremely difficult to reconcile this behavior 
with that of a slowly decaying relic field. 


A full description of any natural MHD dynamo can 
be obtained only through a solution of all the govern- 
ing equations. That said, many of the most important 
aspects of dynamo theory, and many of its subtleties, 
can be captured via consideration of the induction 
equation alone. 

7.1 The Kinematic Dynamo Problem 

The kinematic dynamo problem asks whether it is pos- 
sible to find a velocity u(x, t) such that the magnetic 
field B, governed solely by the induction equation, does 
not decay. More formally, a flow u is said to act as a 
kinematic dynamo il the magnetic energy, 

M(t) = - 1 -[ B 2 dV, 

Z/L/o Jail space 

does not tend to zero as t — oo. For the simplest case of 
steady flows u(x), the magnetic field varies as e pt ; the 
induction equation reduces to an eigenvalue problem 
for p, with dynamo action if there exists an eigenvalue 
with Re(p) > 0. 

Although easily stated, the kinematic dynamo prob- 
lem is rather more demanding to solve. Indeed, whereas 
it is possible to prove rigorously a number of anti- 
dynamo theorems, proving the positive result is less 
straightforward. 

7.2 Antidynamo Theorems 

Antidynamo theorems address either the magnetic 
field or the velocity. Those that address the former, 
which are geometric in nature, reveal how certain types 
of field cannot be generated by dynamo action. Those 
that address the latter, which are concerned with either 
the flow geometry or, alternatively, some property of 
the flow, such as its amplitude or stretching capac- 
ity, demonstrate that specific classes of velocity cannot 
succeed as dynamos. 

The most important result concerning the magnetic 
field is Cowling's theorem of 1933, which states that 
a steady axisymmetric field cannot be maintained by 
dynamo action. Cowling argued that such a magnetic 
field must have a closed contour on which the poloidal 
field Bp vanishes and around which the lip-lines are 
closed (an “O-type” neutral point); consideration of 
Ohm’s law then shows that the induction effect cannot 
overcome diffusion in the neighborhood of the neutral 
point. It is important, however, to point out that non- 
axisymmetric magnetic fields can be generated from 
axisymmetric flows. 
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In terms of the geometrical properties of the velocity, 
the most revealing antidynamo theorem is due to Zel- 
dovich, who proved that a planar two-dimensional flow, 
m = ( u(x,y,z , t),v(x,y, z, t), 0), say, is incapable of 
dynamo action. The proof proceeds in two identical 
steps. First, the z-component of the induction equation, 

it - bV 2 B 2 , (6) 

reveals that there is no source term for By, it therefore 
follows that B z — ■ 0. Hence, only the field in the xy- 
plane need be considered. On writing B = V x (Az), A 
then also satisfies equation (6), from which it follows 
that B tends to zero. Thus, stretching of the field is not 
in itself sufficient for dynamo action; lifting the field 
out of the plane and folding it, constructively, is an 
essential factor. 

Bounds on flow properties necessary for dynamo 
action may be obtained through consideration of the 
evolution of the magnetic energy. Suppose that an 
incompressible conducting fluid is contained in a vol- 
ume V, external to which is an insulator. The scalar 
product of equation (1) with B/p o, after using the 
divergence theorem and the boundary conditions, gives 

XT [ dV = ( J ■ (u x B) dV - - f J 2 dV. 

dt Jv 2po Jv cr Jv 

The second term on the right-hand side can be bounded 
in terms of the magnetic energy using calculus of vari- 
ations. The first can be bounded in terms of either the 
maximum velocity or the maximum of the rate-of-strain 
tensor. If V is a sphere, then the former method reveals 
that dynamo action requires that Rm > tt, where Rm 
is based on the maximum fluid velocity; the latter gives 
the bound Rm > tt 2 , where the magnetic Reynolds 
number Rm is now defined (unconventionally) using 
the maximum of the rate-of-strain tensor. 

Putting everything together, antidynamo theorems 
reveal the essence of at least some of what is needed for 
a successful dynamo; complexity (in the field and flow) 
and sufficiently high Rm are vital ingredients. Conse- 
quently, most demonstrations of dynamo action are the 
result of numerical (i.e., computational) solution of the 
induction equation for prescribed velocities. 

7.3 Mean-Field MHD 

One of the great advances in dynamo theory, pioneered 
by Steenbeck, Krause, and Radler in the 1960s, was to 
seek a solution not for the magnetic field itself but for 
its mean value (B), where the averaging operator (■) 
obeys the Reynolds rules; for example, this could be an 


ensemble average or, more appropriately for isolated 
bodies such as the Earth, an azimuthal average. 

Averaging equation (1), and supposing for the mo- 
ment that there is no mean flow, gives 

= V x(uxb) + nV 2 (B), 

where b is the fluctuating magnetic field. The term 
E = (ux b), which lies at the very heart of the theory 
and represents the essential difference between the 
averaged and unaveraged induction equations, denotes 
a mean electromotive force resulting from correlations 
between the fluctuating velocity and the fluctuating 
field. The system is closed through an argument relat- 
ing b, and hence E, linearly to (B) , and then postulating 
an expansion 

E i = a ij {B)j + p m d ^ + --- . ( 7 ) 

In a kinematic theory, the tensors ay and jSyfc depend 
only on the statistical properties of the flow and on 
Rm. The first term (the “a-effect” of mean- field MHD) 
provides a possible source of dynamo action; the sec- 
ond generally represents an additional, turbulent diffu- 
sion. From (7) it is immediately possible to make a very 
strong statement. Since £ is a polar vector, whereas 
(B) is axial, ay must be a pseudotensor, i.e., a ten- 
sor that changes sign under a transformation from 
right-handed to left-handed coordinates. It therefore 
vanishes if the flow, on average, is reflectionally sym- 
metric. Thus, for mean-field dynamo action through 
the a-effect, the flow must possess “handedness,” the 
simplest measure of this being the flow helicity (u ■ 
V x u). Astrophysical flows, influenced by rotation, will 
typically be helical. 

A mean-field dynamo may be thought of as a means 
of maintaining the “dynamo loop” between poloidal 
and toroidal fields. Large-scale differential rotation is 
a generic feature of astrophysical bodies and provides 
a natural means of generating toroidal magnetic field 
by stretching the poloidal component (often referred to 
as the 13-effect). Closing the loop by the regeneration of 
poloidal field from toroidal, which is the less physically 
transparent step, is then accomplished by the a-effect. 
The combination of these processes is known as an aQ- 
dynamo. Alternatively, both elements of the cycle can 
be achieved through the a-effect, without the need for 
differential rotation (an a 2 -dynamo). 

A model flow that has proved to be extremely influ- 
ential in the development of dynamo theory is that 
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introduced by G. O. Roberts in 1972, defined by 
(dtp dip \ 

u =^-,- — ,ip), with ip = cos x + cosy. (8) 

The flow is maximally helical (u parallel to V x u). For 
z-independent flows, such as (8), the induction equa- 
tion supports solutions of the form B(x,y, t)e lkz ', for 
any given wave number k, the problem then becomes 
simpler, involving only two, rather than three, spatial 
dimensions. In this geometry, a mean-field dynamo is 
one in which the field has a long z wavelength (small k), 
with averages taken over the xy-plane. 

An exciting experimental challenge, taken up by a 
number of groups worldwide, is to construct a labora- 
tory MHD dynamo. One of the successes has been the 
Karlsruhe dynamo, in which liquid sodium is pumped 
through an array of helical pipes, thereby producing a 
flow similar to that conceived by Roberts. 

7.4 Fast Dynamos 

As discussed earlier, the magnetic diffusion timescale 
for astrophysical bodies is extremely long; it is thus of 
interest to ask whether dynamo action can proceed on 
a shorter, diffusion-independent timescale. The math- 
ematical abstraction of the astrophysical problem, in 
which Rm is large but finite, is to investigate kinematic 
dynamo action in the limit as Rm — oo. To fix ideas, con- 
sider a steady flow u(x) for which the magnetic field 
assumes the form B(x)e pt . A flow will act as a dynamo 
if Re(p) > 0 for some value of Rm\ it is said to act as a 
fast dynamo if 

lim Re(p) > 0; 

J?m— oo 

otherwise, it is said to be slow. 

An extremely important result is the derivation of an 
upper bound for fast-dynamo growth, the growth rate 
for steady or time-periodic flows being bounded above 
by the topological entropy of the flow, Ii t op- This is a 
measure of the complexity of the flow and is closely 
related to Iiune, the rate of stretching of material lines 
(for two-dimensional flows, they are identical; for three- 
dimensional flows, It line ^ htop)- Since nonchaotic flows 
have zero topological entropy, a simple consequence 
is the following powerful anti-fast-dynamo theorem: 
fast-dynamo action is not possible in an integrable flow. 
Steady, two-dimensional flows, such as the Roberts 
flow (8), are therefore guaranteed to act (at best) only as 
slow dynamos. That said, it is of interest to see to what 
extent the Roberts dynamo fails to be fast; the answer 
turns out to be “not by much.” In a powerful asymp- 
totic analysis, Soward proved that the growth rate as 


Rm — oo is given by 

ln(lnRm) 

P ~ , 

\nRrn 

with its “slowness” attributed to the long time spent by 
fluid elements in the neighborhood of the stagnation 
point. 

Given the difficulties, in general, in providing rig- 
orous demonstrations of dynamo action, it is per- 
haps not surprising that these are exacerbated for the 
more restrictive class of fast dynamos. The numerical 
approach is to consider dynamos at increasing values 
of Rm, in the hope of reaching an asymptotic regime 
in which the growth rate is positive and ceases to 
vary with Rm. For three-dimensional flows, this has 
so far proved inconclusive; the requisite computational 
resources increase with Rm (in order to resolve finer- 
scale structures), and no plausible Rm-independent 
regime has yet been attained, even with today’s com- 
puting facilities. Instead, the most convincing examples 
of fast-dynamo action have come from time-dependent 
modifications of the Roberts flow, allowing exploration 
of the dynamo up to Rm = O(10 6 ), with the time 
dependence circumventing the anti-fast-dynamo theo- 
rem. Although clearly not a proof, the numerical evi- 
dence is strong that such flows can act as fast dynamos. 

8 Instabilities 

Magnetic fields can play a significant role in modi- 
fying classical hydrodynamic instabilities driven, for 
example, by shear flows or by convection. They can, 
though, also be the agent of instability. Here we con- 
sider two such examples, each extremely important in 
an astrophysical context. 

8.1 Magnetic Buoyancy 

In section 5.1 we discussed how magnetic pressure can 
cause the rise of isolated tubes of magnetic flux, a man- 
ifestation of a lack of equilibrium. But magnetic buoy- 
ancy can also act as an instability mechanism, one that 
is believed to be responsible for facilitating the breakup 
of a large-scale field in the solar interior into the flux 
tubes that subsequently rise to the surface. 

Consider a static equilibrium atmosphere with a 
homogenous horizontal magnetic field whose strength 
varies with height z. For motions that do not bend the 
magnetic field lines (interchange modes), the criterion 
for instability can be derived from a fluid parcel argu- 
ment, assuming that a displaced parcel conserves its 
mass, magnetic flux, and specific entropy. It is then 
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straightforward to show that a horizontal field B(z)x 
is unstable to interchange disturbances if 


gV a d 
Vc dz 


In 


\p ) y dz V p y ) 


= N 2 


where N is the buoyancy frequency. Significantly, insta- 
bility can occur even in convectively stable atmo- 
spheres (N 2 > 0) provided that B/p falls off suf- 
ficiently rapidly with height. Somewhat surprisingly, 
three-dimensional modes, despite having to do work 
against magnetic tension, are more readily destabi- 
lized, requiring a sufficiently negative gradient only 
of B (rather than B/p). 


8.2 Magnetorotational Instability 

A fundamental and long-standing problem in astro- 
physics concerns the mechanism by which matter 
accreting onto a massive central object (such as a 
neutron star or black hole) loses its angular momen- 
tum. Some sort of turbulent process is needed in 
order to transport angular momentum outward as the 
mass moves inward. From a purely hydrodynamic view- 
point, a Keplerian flow, with angular velocity depend- 
ing on radius as Q ~ r~ 3/2 , is stable according to 
Rayleigh’s criterion [IV.28 §8.3], and thus there is 
no obvious route to hydrodynamic turbulence. But the 
picture is changed dramatically by the effect of a mag- 
netic held, the significant astrophysical consequences 
of which were first realized by Balbus and Hawley in 
the 1990s. A new instability— the magnetorotational 
instability — can then ensue provided that the angular 
velocity decreases with radius, a less stringent crite- 
rion than that for hydrodynamic flows, for which a 
decrease of angular momentum is necessary. The non- 
linear development of this instability may thus pro- 
vide the turbulence required for angular momentum 
transport. 

It is of great physical interest to ask why a swirling 
how that is hydrodynamically stable can be destabi- 
lized by the inclusion of a magnetic held. The crucial 
factor is that the held can provide a means of relax- 
ing the angular momentum constraint that underpins 
the hydrodynamic result. As such, it turns out that 
the underlying mechanism can then be explained by 
mechanical, rather than fluid mechanical, arguments. 
Suppose, as an analogy for fluid elements connected 
by a magnetic held line, one considers two spacecraft 
(mi and m 2 ) orbiting a central body, at different radii 
(ri < tt) and joined by a weak elastic tether. Since 
the angular velocity is a decreasing function of radius, 


Figure 3 Computational simulation of the breakup of a 
magnetic layer, resulting from the nonlinear development 
of magnetic buoyancy instabilities; the magnetic held is 
pointing into the page. Such a process is of importance 
in triggering the escape of magnetic field from the solar 
interior. 


the spacecraft at radius ri will move ahead, stretching 
the tether. In so doing, angular momentum is removed 
from mi and transferred to m 2 . The spacecraft then 
readjust to orbits compatible with their new angu- 
lar momenta; mi moves inward, m 2 outward. Since 
dQ/dr < 0, the difference in the angular velocity is 
increased; a fortiori the process is repeated, leading 
to an exponential separation of the spacecraft in time. 
Note that this argument fails if the tether is sufficiently 
strong, since in this case it acts to keep the spacecraft 
together. Instability therefore occurs for weak elastic 
coupling or, in the MHD context, for weak magnetic 
fields. 


9 Current State of Play 

Fundamental questions remain unanswered in all as- 
pects of MHD; it thus remains an extremely active 
and exciting research area. The major theoretical dif- 
ficulties, as with (nonmagnetic) hydrodynamics, arise 
from two directions: the extreme values of the param- 
eters of interest and the inherent nonlinearity of the 
coupled equations (though it would be a mistake to 
think that everything is understood in, for example, the 
(kinematic) fast-dynamo problem). 

Ever-increasing computing power has allowed inves- 
tigation of the full MHD equations at moderately high 
values (< 10 4 ) of the fluid and magnetic Reynolds num- 
bers (as shown, for example, in figure 3), which has 
in turn stimulated new theoretical understanding, par- 
ticularly of nonlinear phenomena. A direct computa- 
tional solution, i.e., of astrophysical or industrial flows 
at the true parameter values, is not feasible; nor, even 
if the current rate of increase in computing power 
is maintained, will it be so in the foreseeable future. 
Our understanding of MHD will therefore progress 
through the interaction of theoretical, computational, 
and experimental approaches. 
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IV.30 Earth System Dynamics 

Emily Shuckburgh 


1 Introduction 

Mathematics is fundamental to our understanding of 
the Earth’s weather and climate. Over the last 200 years 
or so, a mathematical description of the evolution of 
the Earth system has been developed that allows pre- 
dictions to be made concerning the weather and cli- 
mate. This takes account of the fact that the atmo- 
sphere and oceans are thin films of fluid on the spher- 
ical Earth under the influence of (i) heating by solar 
radiation, (ii) gravity, and (iii) the Earth’s rotation. The 
mathematician Lewis Fry Richardson first proposed a 
numerical scheme for forecasting the weather in 1922, 
and this paved the way for modern numerical models 
of the weather and climate. 

The atmosphere varies on a range of timescales, from 
less than an hour for an individual cloud to a week or 
so for a weather system. The timescales of variabil- 
ity in the ocean are typically longer: the ocean sur- 
face layer, which is directly influenced by the atmo- 
sphere, exhibits variability on diurnal, seasonal, and 
interannual timescales, but the ocean interior varies 
significantly only on decadal to centennial and longer 
timescales. The different components of the Earth sys- 
tem (the atmosphere, oceans, land, and ice) are closely 
coupled. For example, roughly half of the carbon diox- 
ide that is released into the atmosphere by human 
activities each year is taken up by the oceans and the 
land. Changes in atmospheric temperature impact the 
ice, which in turn influences the evolution of both the 


atmosphere and the ocean. Furthermore, changes in 
sea-surface temperature can directly affect the atmo- 
sphere and its weather systems. Each of these interac- 
tions and feedbacks is more or less relevant on differ- 
ent timescales. The climate is usually taken to mean 
the state of the Earth system over years to decades or 
longer. It is sometimes defined more precisely as the 
probability distribution of the variable weather, tradi- 
tionally taken over a thirty-year period. While aspects 
of the weather are sensitive to initial conditions, as 
famously demonstrated by Edward Lorenz in his semi- 
nal work on chaos theory (see the lorenz equations 
[III.20]), the statistics of the weather that define the 
climate do not exhibit such sensitivity. 

2 The Temperature of the Earth 

Understanding what determines the temperature of 
the surface of the Earth has been something that has 
long fascinated mathematicians. In 1827 the mathe- 
matician Joseph Fourier wrote an article on the sub- 
ject, noting that he considered it to be one of the most 
important and most difficult questions of all of natural 
philosophy. 

2.1 Solar Forcing 

The atmosphere is continually bombarded by solar 
photons at infrared, visible, and ultraviolet wave- 
lengths. It is necessary to consider the passage through 
the atmosphere of this incoming solar radiation to 
determine the temperature of the surface of the Earth. 
The air in the atmosphere is a mixture of different 
gases: nitrogen (N 2 ) and oxygen (O 2 ) are the largest by 
volume, but other gases including carbon dioxide (CO 2 ), 
water vapor (H 2 O), methane (CH 4 ), nitrous oxide (N 2 O), 
and ozone (O 3 ) play significant roles in influencing the 
passage of photons. Atmospheric water vapor is partic- 
ularly important in this context. The amount of water 
vapor is variable (typically about 0.5% by volume), being 
strongly dependent on the temperature, and it is pri- 
marily a consequence of evaporation from the ocean 
(which covers some 70% of the surface of the Earth). 

Some incoming solar photons are scattered back 
to space by atmospheric gases or reflected back to 
space by clouds or the Earth's surface (with snow and 
ice reflecting considerably more than darker surfaces); 
some are absorbed by atmospheric gases or clouds, 
leading to heating of parts of the atmosphere; and 
some reach the Earth's surface and heat it. Atmospheric 
gases, clouds, and the Earth’s surface also emit and 
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absorb infrared photons, leading to further heat trans- 
fer between one region and another, or loss of heat to 
space. 

The amount of energy that the Earth receives from 
the sun has varied over geological history, but at 
present the incident solar flux, or power per unit area, 
of solar energy is F = 1370 W m~ 2 . 

Given that the cross-sectional area of the Earth inter- 
cepting the solar energy flux is tto 2 , where a is the 
Earth’s radius (mean value a = 6370 km), the total solar 
energy received per unit time is Fna 2 = 1.74 x 10 17 W. 
As noted above, not all of this radiation is absorbed 
by the Earth; a significant fraction is directly reflected. 
The ratio of the reflected to incident solar energy is 
the albedo, a. Under present conditions of cloudiness, 
snow, and ice cover, an average of about 30% of the 
incoming solar radiation is reflected back to space with- 
out being absorbed, i.e., a = 0.3. The surface of the 
Earth has an area 4ua 2 . That means that per unit area, 

final incoming power = ^(1 — cx)F, 

which is approximately 240 W m~ 2 . 

In physics, a black body is an idealized object that 
absorbs all radiation that falls on it. Because no light is 
reflected or transmitted, the object appears black when 
it is cold. However, a black body emits a temperature- 
dependent spectrum of light. As noted above, the Earth 
reflects much of the radiation that is incident upon it, 
but for simplicity let us initially assume that it emits 
radiation in the same temperature-dependent way as a 
black body. In this case the emitted radiation is given by 
the Stefan-Boltzmann law, which states that the power 
emitted per unit area of a black body at absolute tem- 
perature T is crT 4 , where cr is the Stefan-Boltzmann 
constant (cr = 5.67 x 1CU 8 W m~ 2 Kr 4 ). This power is 
emitted in all directions from the surface of the Earth, 
so that per unit area, 

final outgoing power = crT bb 


if the Earth has a uniform surface temperature Tbb . This 
defines the emission temperature under the assump- 
tion that the Earth acts as a black body, which is the 
temperature one would infer by looking back at the 
Earth from space if a black body curve was fitted to 
the measured spectrum of outgoing radiation. 

Assuming that the Earth is in thermal equilibrium, 
the incoming and outgoing power must balance. There- 
fore, by equating the equations for the final incoming 
power and the final outgoing power, 

/ (1 _ (vlFX 1 / 4 


Tbb = 


«)f y 

4(7 / 


Space 



Ground 


Figure 1 A simple model of the greenhouse effect. The 
atmosphere is taken to be a shallow layer at temperature T a 
and the ground a black body at temperature T g . The various 
solar and thermal fluxes are shown (see text for details). 


Substituting standard values for a, F, and cr gives 
Tbb ~ 255 K. This value is of the correct order of mag- 
nitude but is more than 30 K lower than the observed 
mean surface temperature of Te ~ 288 K. The simplest 
possible model of the climate system has therefore cap- 
tured some of the key aspects, but it must have some 
important missing ingredients. 

2.2 The Greenhouse Effect 

To refine the calculation of the temperature of the sur- 
face of the Earth, the influence of atmospheric con- 
stituents in emitting, absorbing, and scattering radia- 
tion needs to be taken into consideration. This can be 
achieved by assuming that the system has a layer of 
atmosphere of uniform temperature T a that transmits 
a fraction t sw of incident solar (shortwave) radiation 
and a fraction ti w of any incident terrestrial (longwave) 
radiation while absorbing the remainder (see figure 1). 

As noted above, the final incoming solar power per 
unit area at the top of the atmosphere is F s = |( 1 -a)F. 
Under the revised model, a proportion t sw F s reaches 
the ground and the remainder is absorbed by the atmo- 
sphere. Let us assume that the ground has a tempera- 
ture T g and that it emits as a black body. This gives an 
upward flux of T g = crT 4 , of which a proportion Ti w T g 
reaches the top of the atmosphere, with the remainder 
being absorbed by the atmosphere. The atmosphere is 
not a black body, and instead, it emits radiation fol- 
lowing Kirchhoff's law, such that the emitted flux is 
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F a = (1 - Ti w ) cr T a both upward and downward, where 
T a is the temperature of the atmosphere. 

Assuming that the system is in equilibrium, these 
fluxes must balance. At the top of the atmosphere 
we have F s = F a + ti w F g , and at the ground we have 
F g = F a + t s W .Fs- By eliminating F a , and using the 
Stefan-Boltzmann law for T g , we find that 


t l + t sw (1 - «)-F \ 1/4 = / 1 + t sw \ 1/4 
Vl+Tlw 40- ) Vl + Tlw/ bb 


Substituting reasonable values for the Earth’s present- 
day atmosphere (t sw = 0.9 and ti w = 0.2) gives a sur- 
face temperature of T s ~ 286 K, which is much closer 
to the observed value of Te » 288 K. Including an 
atmosphere that allows greater transmission of short- 
wave radiation than longwave radiation has had the 
influence of increasing the surface temperature. This 
is known as the greenhouse effect. The temperature of 
the atmosphere T a under this model is 


T a 


/ 1 TlwT sw \ 1/4 
^ 1 - Tiw > 


which gives T a « 245 K. 

The expression for T g indicates that the temperature 
of the surface of the Earth can change as a result of 
changes to the solar flux (F), changes to the albedo 
(a), and/or changes to the transmission of radiation 
through the atmosphere (t). This provides a basic 
explanation for the cycle of ice ages that has been 
observed to occur on Earth every hundred thousand 
years or so. Over long timescales, changes to the orbit 
of the Earth around the sun occur that are associated 
with the Milankovitch cycles. Resulting changes to the 
annual global mean F are small, but changes to the 
seasonal/latitudinal pattern of solar radiation reaching 
the Earth lead to the ice sheets in the Northern Hemi- 
sphere shrinking or growing, changing a. This then 
modifies the temperature T g , which further changes 
the ice sheets. Additionally, changes in T g can result 
in changes that alter the balance of exchange of carbon 
dioxide between the atmosphere and the land/ocean, 
which changes the transmission value t, modifying the 
temperature still further. 


3 Atmospheric Properties 

We now turn our attention to the vertical variation in 
temperature in the atmosphere. This is influenced by 
radiative processes and by convection. 


3.1 Radiation 


To refine the model of the transfer of radiation through 
the atmosphere further it is necessary to consider 
the atmospheric properties in more detail. To a good 
approximation, the atmosphere as a whole behaves as 
a simple ideal gas, with each mole of gas obeying the 
law pV = RgT, where p is the pressure, V is the vol- 
ume of one mole, R g is the universal gas constant, and 
T is the absolute temperature. If M is the mass of one 
mole, the density is p = M/V, and the ideal gas law 
may be written as p/p = RT, where R = R g /M is the 
gas constant per unit mass. The value of R depends on 
the composition of the sample of air. For dry air it is 
R = 287 J kg -1 K -1 . 

Each portion of the atmosphere is approximately in 
what is known as hydrostatic balance (usually valid 
on scales greater than a few kilometers). This means 
that the weight of the portion of atmosphere is sup- 
ported by the difference in pressure between the lower 
and upper surfaces, and that the following relation- 
ship between density p and pressure p holds to a good 
approximation: 


0P= ~ 


dp 
dz ’ 


( 1 ) 


where g = 9.81 m s~ 2 is the gravitational acceleration 
and z is the height above the ground. The ideal gas law 
can be used to replace p in this equation by p /RT. 

The temperature in the lowest section of the atmo- 
sphere, known as the troposphere, decreases with alti- 
tude, from the surface value of about 288 K to about 
217 K at its upper limit (about 10-15 km altitude). 
In general, the temperature does not vary greatly in 
the atmosphere. The mass-weighted mean temperature 
is approximately 255 K, and over the first 100 km 
in altitude the temperature varies by no more than 
about ±15%. Approximating the atmospheric temper- 
ature by T « To = const, and using the ideal gas law, 
(1) can be integrated to obtain p = poe~ azlRT °, where 
Po is the pressure at z = 0, which is approximately 
1000 hPa. Therefore, at 5 km altitude the pressure is 
about 500 hPa and at 10 km it is about 250 hPa. A simi- 
lar expression can be derived for the density. Therefore, 
gravity tends to produce a density stratification in the 
atmosphere, which means that the atmosphere can be 
considered from a dynamical perspective as a stratified 
fluid on a rotating sphere. 

As discussed above, certain gases in the atmosphere, 
known as greenhouse gases, act to absorb radiation 
of certain wavelengths. At the surface, the relatively 
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high temperature and pressure mean that these gases 
absorb radiation in broad bands around specific wave- 
lengths. These bands are made up of many individ- 
ual spectral lines. The individual lines are broadened 
by collisions (pressure broadening), and this broad- 
ening reduces in width with altitude as the pressure 
decreases. If we consider the atmosphere to be made up 
of many thin vertical layers, as radiation emitted from 
the Earth’s surface moves up layer by layer through the 
troposphere, some is stopped in each layer. Each layer 
then emits radiation back toward the ground and up to 
higher layers. However, due to the reduction in width of 
the absorption lines with altitude, a level can be reached 
at which the radiation is able to escape to space. In addi- 
tion, because the amount of gas between a given alti- 
tude and space decreases with increasing altitude, even 
the line centers are more able to emit directly to space, 
with increasing altitude. Adding more greenhouse gas 
molecules means that the upper layers will absorb more 
radiation and the altitude of the layer at which the radi- 
ation escapes to space increases, and hence its temper- 
ature decreases. Since colder layers do not radiate heat 
as well, all the layers from this height to the surface 
must warm to restore the incoming/outgoing radiation 
balance. 

It has become standard to assess the importance of 
a factor (such as an increase in a greenhouse gas) as 
a potential climate change mechanism in terms of its 
radiative forcing, A F. This is a measure of the influ- 
ence the factor has in altering the balance of incoming 
and outgoing energy, and it is defined as the change 
in net irradiance (i.e., the difference between incoming 
and outgoing radiation) measured at the tropopause 
(the upper boundary of the troposphere). For carbon 
dioxide the radiative forcing is given to a good approx- 
imation by the simple algebraic expression A F = 5 . 3 5 x 
InC/Co W m " 2 , where C is the concentration of carbon 
dioxide and Co is a reference preindustrial value, taken 
to be 278 parts per million. For a doubling of carbon 
dioxide values above preindustrial values, this gives a 
radiative forcing of approximately 3.7 W m -2 . 

The climate sensitivity, A, is defined to be the coef- 
ficient of proportionality between the radiative forc- 
ing, A F, and AT, the associated change in equilib- 
rium surface temperature that occurs over multicen- 
tury timescales, i.e., AT = AAF. For the very simplest 
climate model, we found that the emitted radiation per 
unit area was F = tTT bb , from which we inferred a value 
of Ebb ~ 255 K. Approximating AT by AT « AT dT/dT, 
the climate sensitivity in the absence of feedbacks is 


then given by A = (4crT bb ) _1 , or 0.26 K/(W nr 2 ). Using 
this to estimate the temperature increase at equilib- 
rium due to a doubling of carbon dioxide (often called 
the equilibrium climate sensitivity) gives AT » 1 K. 
In reality, feedbacks in the system (e.g., changes to 
the albedo and the water vapor content of the atmo- 
sphere) will influence the temperature change. Tak- 
ing into account the feedbacks, the Intergovernmen- 
tal Panel on Climate Change concluded in its Fifth 
Assessment Report that this equilibrium climate sensi- 
tivity probably lies in the range 1.5 -4. 5 K and that it is 
extremely unlikely to be less than 1 K and very unlikely 
to be greater than 6 K. 1 


3.2 Convection 


In radiative equilibrium, the surface is warmer than 
the overlying atmosphere. This state is unstable to 
convective motions. 

The first law of thermodynamics states that the 
increase in internal energy of a system SU is equal to 
the heat supplied plus the work done on the system. 
This can be written as SU = T5S - pSV, where T is 
the temperature, V is the volume, and S is the entropy 
of the system. If Q is the amount of heat absorbed 
by the system, SS = 5Q/T. The specific heat capac- 
ity, c, is the measure of the heat energy required to 
increase the temperature of a unit quantity of air by 
one unit. The specific heat of substances are typically 
measured under constant pressure (c p ). However, fluids 
may instead be measured at constant volume (cv). Mea- 
surements under constant pressure produce greater 
values than those at constant volume because work 
must be performed in the former. For an ideal gas, 
c p = cv + R (for dry air, c p = 1005 J kg -1 KT 1 ). Con- 
sidering a unit mass of an ideal gas, for which V = 1 Ip, 
it can be shown that U = cyT, and hence 


SS 


_ SQ _ ST 

-j- 


R — . 
P 


( 2 ) 


An adiabatic process is one in which heat is neither lost 
nor gained, so SS = 0. In this case, (2) can be integrated 
to give 0 = T(po/p) R,c p, if T = 0 when p = po. The 
quantity 0, the potential temperature, is the tempera- 
ture a portion of air would have if, starting from tem- 
perature T and pressure p, it were compressed until its 
pressure equaled po- 


1. Intergovernmental Panel on Climate Change reports are available 
from www.ipcc.ch. 
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From (2), the first law of thermodynamics can be 
written as 

dT _ RT_ dp 1 dQ 

dt c p p dt c p dt 

The first term on the right-hand side represents the rate 
of change of temperature due to adiabatic expansion or 
compression. In a typical weather system outside the 
tropics, air parcels in the middle troposphere undergo 
vertical displacements on the order of 100 hPa day -1 . 
Assuming that T » 250 K, the resulting adiabatic tem- 
perature change is about 15 K day -1 . The second term 
on the right-hand side is the diabatic heat sources 
and sinks, which include absorption of solar radiation, 
absorption and emittance of longwave radiation, and 
latent heat release. In general, in the troposphere the 
sum of diabatic terms is much smaller than the sum 
of adiabatic ones, with the net radiative contribution 
being small and the latent heat terms being compara- 
ble in magnitude with the adiabatic term only in small 
regions. For an adiabatically rising parcel of air, from 
(2) the change in temperature is given by 


(—) = 

RT , 

fdp\ 

V dz / parcel 

CpP 

dz /, 


parcel 


a = r 

— la 


if the atmosphere is in hydrostatic balance (see (1)). 
Here, T a ~ 10 K knT 1 is known as the adiabatic lapse 
rate. 

The observed decrease of temperature with height, 
the lapse rate E = -dT/dz, generally differs from the 
adiabatic lapse rate. If the observed background tem- 
perature falls more rapidly with height than the adia- 
batic lapse rate, i.e., E > T a , then an adiabatically ris- 
ing parcel will be warmer than its surroundings and 
will continue to rise. In this case, the atmosphere is 
unstable. On the other hand, if E < T a then the atmo- 
sphere is stable. In general, the atmosphere is stable 
to this dry convection, but it can be unstable in hot, 
arid regions such as deserts. Convection carries heat 
up and thus reduces the background lapse rate until 
the dry adiabatic lapse rate is reached. 

Considering the buoyancy forces on a parcel that 
has been raised to a height Sz above its equilibrium 
position and applying Newton’s second law gives 
d 2 


p dFz Sz 


-g$p. 


where 8p is the difference between the parcel density 
and that of the environment. The pressures inside and 
outside the parcel are the same, and so using the ideal 
gas law this can be rewritten after some manipulation 
as 


d 2 

dt 2 


8z + N"8z = 0, N 2 


g 


(Ta-D. 


For AT 2 < 0, T > T a , the solutions are exponential in 
time, and the atmosphere is unstable. For AT 2 > 0, 
E < T a , the motion is an oscillation v\ith frequency N, 
and the atmosphere is stable. The quantity N 2 is a use- 
ful measure of atmospheric stratification and can be 
written in terms of the potential temperature: 


A region of the atmosphere is therefore statically sta- 
ble if Q increases with height (dd/dz > 0), and it is 
statically unstable if it decreases with height. 

Water vapor in the atmosphere plays an important 
role in the dynamics of the troposphere since latent 
heating and cooling can transfer heat from one location 
to another and because it influences convection. As a 
moist air parcel rises adiabatically, p falls, so T falls, 
the water vapor condenses, and latent heat is released, 
and hence the moist adiabatic lapse rate is less than for 
dry air (and thus is more easily exceeded). Convective 
processes in the atmosphere strongly influence the ver- 
tical temperature structure in the troposphere. Simple 
one-dimensional radiative equilibrium calculations pre- 
dict that the temperature decreases sharply with height 
at the lower boundary, implying convective instability. 
Calculations including both radiative and convective 
effects — adjusting the temperature gradient to neutral 
stability where necessary, and taking into account the 
effects of moisture — predict a less sharp decline in tem- 
perature through the troposphere, in agreement with 
observations. 

For descriptive purposes, the atmosphere can be 
divided into layers, defined by alternating negative and 
positive vertical temperature gradients. In the tropo- 
spheric layer from the ground up to about 10-15 km, 
the temperature decreases with height and the tem- 
perature structure is strongly influenced by convective 
processes. The temperature then increases with height 
to about 50 km in the stratosphere, where the tempera- 
ture structure is determined predominantly by radia- 
tive processes. After this the temperature decreases 
again through the mesosphere. 


4 Oceanic Properties 

As noted above, the oceans are a key component of the 
coupled climate system. To understand their role, it is 
necessary to detail their properties and how they are 
forced. 

The oceans are stratified by density, with the dens- 
est water near the seafloor and the least dense water 
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near the surface. The density depends on tempera- 
ture, salinity, and pressure in a complex and nonlin- 
ear way, but temperature typically influences density 
more than salinity in the parameter range of the open 
ocean. In discussions of ocean dynamics, the buoyancy 
b = -g(p - p*)l p is often used, where p is the den- 
sity of a parcel of water and p* is the density of the 
background. Thus, if p < p* the parcel will be posi- 
tively buoyant and will rise. Since the density does not 
vary greatly in the ocean (by only a few percent), the 
buoyancy can be written as b = -g(cr - cr*)/ po, where 
cr = p - po and po = 1000 kg m~ 3 . The temperature 
and salinity, and hence density, vary little with depth 
over the surface layer of the ocean, typically 50-100 m, 
known as the mixed layer. Below this is a layer, called 
the thermocline, where the vertical gradients of temper- 
ature and density are greatest; this varies in depth from 
about 100 m to about 600 m. The waters of the thermo- 
cline are warmer and saltier than the deep ocean below, 
which is known as the abyss. 

The forcing of the oceans is rather different to that 
of the atmosphere. A significant fraction of solar radia- 
tion passes through the atmosphere to heat the Earth’s 
surface and drive atmospheric convection from below, 
whereas in the ocean, convection is driven by buoy- 
ancy loss from above as the ocean exchanges heat 
and freshwater at the surface (including through brine 
rejection in sea-ice formation). The heat flux at the 
ocean surface has four components: sensible heat flux 
(which depends on the wind speed and air/sea temper- 
ature difference), latent heat flux (from evaporation/ 
precipitation), incoming shortwave radiation from the 
sun, and longwave radiation from the atmosphere and 
ocean. The net freshwater flux is given by evapora- 
tion minus precipitation, including the influences of 
river runoff and ice-formation processes. Winds blow- 
ing over the ocean surface exert a stress on it and 
directly drive ocean circulations, particularly in the 
upper kilometer or so. The wind stress is typically 
parametrized by T W ind = p a CDWj 0 , where p a is the den- 
sity of air, it io is the v\4nd speed at 10 m, and cd is a 
drag coefficient (a function of wind speed, atmospheric 
stability, and sea state). Below the surface, the winds, 
the flow over seafloor topography, the tides, and other 
processes indirectly influence the circulation. 

5 Dynamics of the Atmosphere 
and the Oceans 

The mathematics of fluid dynamics governs the motion 
of the atmosphere and the oceans. This framework 


can be used to understand key features of the Earth’s 
weather and climate and to predict future change. 
Density stratification and the Earth's rotation provide 
strong constraints that organize the fluid flows. 


5.1 Rotating Stratified Fluids 


In studies of fluid dynamics it is useful to describe the 
evolution of a parcel of fluid as it follows the flow. This 
rate of change of a quantity is given by the Lagrangian 
derivative, D/Dt, which is defined by 


where u is the velocity of the flow. When the wind blows 
or the ocean currents flow, they carry properties, such 
as heat and pollutants, with them. This is described by 
the term u ■ V, which represents advection. There are 
five key variables relevant to the equations of motion 
for fluid flow: the velocity u = ( u,v,w ), the pressure 
p, and the temperature T. Correspondingly, there are 
five independent equations resulting from Newton’s 
second law, conservation of mass, and the first law of 
thermodynamics. 

Newton’s second law states that in an inertial frame, 


Dm 1 „ 

— = --Vp 
Dt p 


9* +f. 


where - ( 1 /p ) Vp is the pressure gradient force of rel- 
evance for fluids, g* is the gravitational force, and J 
is the sum of the frictional forces, all per unit mass. 
To represent the weather and climate, it is natural to 
describe the flow seen from the perspective of some- 
one on the surface of the Earth, and thus we need to 
consider the motion in the rotating frame of the Earth. 
The angular rotation of the frame is given by the vector 
Cl pointing in the direction of the axis of rotation, with 
magnitude equal to the Earth’s angular rate of rotation 
Cl = 7.27 x 10~ 5 s _1 (one revolution per day). New- 
ton’s second law as described above holds in an iner- 
tial frame of reference. When it is translated into the 
rotating frame of reference, additional terms are intro- 
duced that are specific to that frame. The flow veloc- 
ities in the rotating and inertial frames are related by 
w inertial = M ro tating + Cl x r, and the Lagrangian derivative 


is given by 


/ Dlii n \ 

/ Dw ro t \ 

V D t /in 

V Dt A 


+ 2C1 x Mrot + Cl x Cl x r. 


The additional terms are therefore 2C1 x u ro t, the Cori- 
olis acceleration, and Cl x Cl x r, the centrifugal acceler- 
ation. It is convenient to combine the centrifugal force 
with the gravitational force in one term g = -gz = 
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g* + Q x Cl x r, where z represents a unit vector par- 
allel to the local vertical. The gravity, g, defined in 
this way is the gravity measured in the rotating frame, 
g = 9.81 m s~ 2 . 

The thinness of the atmosphere/ocean enables a 
local Cartesian coordinate system, which neglects the 
Earth’s curvature, to be used for many problems. Tak- 
ing the unit vectors x, y, and z to be eastward (zonal), 
northward (meridional), and upward, respectively, the 
rotation vector can be written in this coordinate basis 
as 12 = (0,13 cos <p,n sirup) for latitude <p. The order 
of magnitude of zonal velocities in the atmosphere is 
| u\ ~ 10 m s _1 (less in the ocean), so Qu <k g. In 
addition, in both the atmosphere and the ocean, ver- 
tical velocities w, typically <10 _1 m s -1 , are much 
smaller than horizontal velocities. Hence, 213 x u can 
be approximated by fz x u, where / = 213 sin <p is the 
Coriolis parameter. (In many instances it proves use- 
ful to approximate the Coriolis parameter by its con- 
stant value at a particular latitude f = fo = 213 sin <fio 
or by its Taylor expansion for small latitudinal depar- 
tures from a particular latitude f = fo + fiy, where 
= d/ /&y\ the latter is known as the /? -plane approxi- 
mation.) 

With these various conventions and approximations, 
Newton’s second law in a rotating frame becomes 

— + — Vp - g + fz x u = f. (3) 

Dtp 


For some purposes it is useful to write the gravity term 
as the gradient of a potential function <t>, known as 
the geopotential, V<E = -g. The geopotential <P(z) at 
a height z is the work required to raise a unit mass to 
height z from mean sea level, <P = /q g d z. The remain- 
ing two equations are the conservation of mass (the 
mass of a fixed volume can change only if p changes, 
and this requires a mass flux into the volume), 

^+pV-u = 0, (4) 

and the first law of thermodynamics (see (2)), 


DQ DT 1 Dp 
Df Cp Dt p Dt ' 


(5) 


Here, DQ/Df is the diabatic heating rate per unit 
mass, which in the atmosphere is mostly due to latent 
heating/cooling (condensation/evaporation) and radia- 
tive heating/cooling. In the ocean, analogous equations 
hold for temperature and salinity. 

The equations of motion we have just derived are a 
simplified form of the equations that are at the heart 
of weather and climate models. Such models solve 


numerically discretized versions of the equations of 
motion, and computational constraints mean that there 
is a limit to the scale of motion that can be directly 
resolved. In the atmosphere, large-scale motion such 
as weather systems (-1000 km) are well captured, but 
smaller-scale motion such as convective storm sys- 
tems (~ 1-100 km) generally need to be parametrized, 
i.e., represented approximately in terms of the larger- 
scale resolved variables. Similarly, in the ocean small- 
scale processes are parametrized (indeed, the scales 
of motion are typically ten times smaller in the ocean 
than in the atmosphere, making the problem even more 
challenging). 

The forcing of an atmosphere-only model may in- 
clude specified solar input, radiatively active gases, 
sea-surface temperature, and sea ice. What is and is 
not included in the model depends on whether the 
processes are important over the timescale for which 
the model is being used to project (hours to weeks 
for weather models, decades to centuries for climate 
models). For climate projections, coupled models 
[IV. 16 §5] are usually employed. In these, separate mod- 
els of the atmosphere, ocean, ice, land, and some chemi- 
cal cycles are linked together in such a way that changes 
in one may influence another. 

5.2 Circulation of the Atmosphere and Ocean 

Tropical regions receive more incoming solar radiation 
than polar regions because the solar beam is concen- 
trated over a smaller area due to the spherical curva- 
ture of the Earth. If the Earth were not rotating, the 
atmospheric circulation would be driven by the pole- 
to-equator temperature difference, with warm air ris- 
ing in the tropical regions and sinking in the polar 
regions. On the rotating Earth, as air moves away from 
the equator in the upper troposphere, it gains an east- 
ward (westerly) velocity component from the Coriolis 
effect, as we will describe mathematically below. In the 
tropics the Coriolis parameter f is small, but elsewhere 
it has a significant influence. The overturning circula- 
tion in the atmosphere is therefore confined to the trop- 
ics: if it extended all the way to the poles, the west- 
erly component arising from the Coriolis effect would 
become infinite. Moist air rises near the equator and 
dry air descends in the subtropical desert regions at 
about 20-30°. This overturning circulation is known 
as the Hadley circulation. During the course of a year, 
the pattern of solar forcing migrates: north in northern 
summer, south in southern summer. The entire Hadley 
circulation shifts seasonally following the solar forcing. 
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Figure 2 Temperature (thin gray contours, interval 5 K) 
and westerly winds (thick black contours, interval 5 m s -1 ; 
zero wind, dotted line; easterly winds, dashed line) in the 
troposphere. All quantities have been averaged in longi- 
tude. Values are typical of the Northern Hemisphere winter. 
Approximate altitude is also indicated. 

In the upper troposphere, at the poleward extent of 
the Hadley circulation (about 30°), are the jet streams 
of strong westerly flow (see figure 2). They are strongest 
in winter, with average speeds of around 30 m s -1 . The 
equatorward return flow at the surface is weak. The 
influence of friction together with the Coriolis effect 
results in the northeasterly trade winds (i.e., winds 
originating from the northeast) in the Northern Hemi- 
sphere and the southeasterly trade winds in the South- 
ern Hemisphere, as we will describe mathematically 
below. 

The westerly flow in the jet streams is hydrodynami- 
cally unstable and can spontaneously break down into 
vortical structures known as eddies, which manifest 
themselves as traveling weather systems. Eddies play 
a vital role in transporting heat, moisture, and chem- 
ical species in the latitude/height plane. Observations 
indicate that in the annual mean the tropical regions 
emit less radiation back to space than they receive and 
that the polar regions emit more radiation than they 
receive. This implies that there must be a transport of 
energy from the equator to the pole that takes places in 
the atmosphere and/or the ocean. In the atmosphere, 
heat is transported poleward through the tropics by 
the Hadley circulation; at higher latitudes, eddies are 
mainly responsible for the heat transport. 

In the ocean there is an overturning circulation 
that encompasses a system of surface and deep cur- 
rents running through all basins. This circulation trans- 
ports heat— and also salt, carbon, nutrients, and other 
substances— around the globe and connects the sur- 
face ocean and atmosphere with the huge reservoirs 
of the deep ocean. As such, it is of critical impor- 
tance to the global climate system. We have discussed 
above the requirement for poleward heat transport in 


the atmosphere and/or ocean to explain the observed 
incoming/outgoing radiation profiles. Detailed calcula- 
tions indicate that the atmosphere is responsible for 
the bulk of the transport in the middle and high lat- 
itudes, but the ocean makes up a considerable frac- 
tion, particularly in the tropics. Heat is transported 
poleward by the ocean in the overturning circulation if 
waters moving poleward are compensated by equator- 
ward flow at colder temperatures. In the Atlantic heat 
transport is northward everywhere, but in the Pacific 
the heat transport is directed poleward in both hemi- 
spheres, while the Indian Ocean provides a poleward 
transport in the Southern Hemisphere. The net heat 
transport is poleward in each hemisphere. 

The surface ocean currents are dominated by closed 
circulation patterns known as gyres. In the Northern 
Hemisphere there are gyres in the subtropics of the 
Pacific and Atlantic, with eastward flow at midlatitudes 
and westward flow at the equator. The current speed 
at the interior of these gyres is <10 cm s 1 , but at 
the western edge there are strong poleward currents 
(the Kuroshio in the Pacific and the Gulf Stream in 
the Atlantic) with speeds >100 cm s -1 . Ocean sur- 
face waters are only dense enough to sink down to 
the deep abyss at a few key locations: particularly in 
the northern North Atlantic and around Antarctica. 
Deep ocean convection occurs only in these cold high- 
latitude regions, where the internal stratification is 
small and surface density can increase through direct 
cooling/evaporation or brine rejection in sea-ice for- 
mation. Dense water formed in the North Atlantic 
flows south as a deep western boundary current and 
eventually enters the Southern Ocean, where it mixes 
with other water masses. Ultimately, the deep water is 
brought up to the surface by vertical mixing (tides and 
winds are the primary sources of energy for this) and 
by the overturning circulation in the Southern Ocean. 
Hydrodynamic instabilities associated with the major 
currents can generate eddies, which are ubiquitous 
in the ocean. As in the atmosphere, these eddies are 
responsible for heat transport and they play a cen- 
tral role in driving the overturning circulation in the 
Southern Ocean. 

6 Dynamical Processes 

To understand the behavior of different dynamical pro- 
cesses in the atmosphere and the oceans, it is useful 
to consider the relative magnitudes of the different 
terms in the equation of motion (3). For this purpose 
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we introduce the Rossby number Ro , which is the ratio 
of acceleration terms to Coriolis terms. If U is a typical 
velocity scale and I is a typical length scale, Ro = U/fL. 


6.1 Ocean Surface Waves 


For ocean waves crashing onshore, the influence of the 
Earth’s rotation is small and so Ro is large and the Cori- 
olis terms in (3) can be neglected. The wave motion 
results when the water surface is displaced above its 
equilibrium level and gravity acts to pull it downward. 
An oscillatory motion results as the water overshoots 
its equilibrium position in both vertical directions and 
is restored by pressure from the surrounding water 
mass in one direction and by gravity in the other. 

For a fluid with uniform density po, a free surface at 
z = 0, and a bottom at z = -H, the equilibrium solu- 
tion has zero velocity and the equilibrium pressure p* 
is given by integrating the hydrostatic balance equa- 
tion (1): p*(z) = -gpz, where p takes the value po in 
the fluid and zero above it. A perturbation to this sys- 
tem can be defined such that the perturbed position of 
the free surface is given by z = q(x,y, t) and the per- 
turbed pressure is given by p = p* + p'. Neglecting the 
Coriolis and frictional terms, (3) gives 


Du 
D t 


1 

P 


Vp'. 


Taking the x, y, and z derivatives of the components 
of this equation and using the continuity equation (4) 
results in laplace’s equation [III. 18] for p', V 2 p' = 0. 
The relevant boundary conditions are that 


(i) there is no normal flow at the bottom (iv = 0 at 
z = -H), from which it follows that dp' /dz must 
vanish at z = -H by considering the z component 
of (3); 

(ii) a particle in the free surface z = r\ will remain in 
it, i.e., D(z - 17) /Df = 0, which gives w = dg/dt at 
z = 17 for small perturbations; and 

(iii) the pressure must vanish at the free surface, i.e., 
p' = pgr] at z = (7. 


The two conditions at z = 17 can also be applied at z = 0 
to good approximation. 

A wavelike solution to this system can be sought that 
is assumed to correspond to a displacement q of the 
same form, where q is the amplitude, k = (k, l) is the 
wave number, and to is the frequency. This takes the 
form 

p' = Re q exp(i(kx + ly - cot)). 


From Laplace’s equation, d 2 p’ /dz 2 - K 2 p' = 0, where 
k 2 = fc 2 + l 2 . Applying the first and third boundary 
conditions, the solution is 

, pgr) coslkx + ly - cot) coshx(z + H) 

v = ■ 

' cosh kH 

Using the z-component of (3) it can be shown that it is 
possible for only the second boundary condition to be 
met and the solution to be consistent with the assumed 
form for q if 

to 2 = gK tanh kH. 

This dispersion relation determines the frequency co of 
waves of a given wave number and hence also the phase 
speed c = co/ k. 

In the shallow-water or longwave limit when kH <k 1, 
tanh i<H -> kH and hence c = ^Jg it. This means that 
all long waves travel at the same speed. Earthquakes 
on the sea floor can excite tsunamis [V.19]. These are 
long waves with wavelengths up to hundreds of kilo- 
meters on an ocean that is at most a few kilometers 
deep. A tsunami therefore propagates at speed -JgH ~ 
200 m s _1 without dispersion, allowing its energy to 
be maintained as it crosses a vast expanse of ocean. 
The tsunami slows as it approaches the shore and the 
water depth shallows, and its wavelength A = 2 ttc/o> 
decreases (to is constant). For the same energy density, 
the amplitude increases until nonlinear effects become 
important and the wave breaks. 


6.2 Midlatitude Weather Systems and Ocean Gyres 


The midlatitudes are the regions between the tropics 
and the polar regions. In these regions the typical veloc- 
ity scales and length scales of weather systems are 
U ~ 10 m s _1 and L ~ 10 6 m, and / ~ 10~ 4 s _1 ; 
hence Ro ~ 0.1. In the ocean, in midlatitude gyres, the 
typical scales are U ~ 0.1 m s _1 and L ~ 10 6 m, so 
Ro ~ 10 -3 . Therefore, inboth cases, becauseRo is small 
we can neglect the acceleration terms in (3) in favor of 
the Coriolis terms. In addition, away from boundaries 
the friction is negligible, so in the horizontal we have 


fzxu+ — Vp = 0. 
P 


This approximation is known as geostrophic balance. 
It is a balance between the Coriolis force and the hori- 
zontal pressure gradient force and is used to define the 
geostrophic velocity w g given by 


u 


g 


(u g ,Vg 



-1 dp 1 dp \ 
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Figure 3 Counterclockwise (cyclonic) geostrophic flow 
around a low-pressure center in the Northern Hemisphere. 
The effect of the Coriolis force deflecting the flow is bal- 
anced by the horizontal component of the pressure gradient 
force, directed from high to low pressure. 

Considering the vertical direction in (3), if friction J z 
and vertical acceleration Dic/Dt are small (as is gen- 
erally true for large-scale atmospheric and oceanic 
systems), then this reduces to the equation (1) for 
hydrostatic balance introduced earlier. 

Geostrophically balanced flow is normal to the pres- 
sure gradient, i.e., along contours of constant pres- 
sure. In the Northern (Southern) Hemisphere, motion 
is therefore counterclockwise (clockwise) around the 
center of low-pressure systems (see figure 3). Note also 
that the speed depends on the magnitude of the pres- 
sure gradient: it is stronger when the isobars are closer. 
When the flow swirls counterclockwise in the Northern 
Hemisphere or clockwise in the Southern Hemisphere, 
it is called cyclonic flow (a hurricane is an example of 
this); the opposite direction is called anticyclonic flow. 

Geostrophic balance can be used to explain many fea- 
tures that are observed in atmosphere and ocean flows. 
In the atmosphere, because of the pole-to-equator tem- 
perature gradient, there is a horizontal pressure gradi- 
ent above the surface in the troposphere going from 
warm tropical latitudes to cold polar latitudes. This 
is in geostrophic balance with the Coriolis force asso- 
ciated with the westerly flow of the midlatitude jet 
streams (see figure 2). Instabilities of the jet streams 
form eddies consisting of cyclonic geostrophic flow 
around low-pressure systems (these are midlatitude 
weather systems). In the ocean the sea surface is higher 
in the warm subtropical gyre of the North Atlantic 
than it is further north in the cool subpolar gyre, 
resulting in a pressure gradient force directed north- 
ward to geostrophically balance the southward Cori- 
olis force associated with the eastward-flowing Gulf 
Stream. Instabilities of the current generate eddies that 
are manifested as geostrophic flow around anomalies 
in sea-surface height. 


Further analysis highlights the fact that rotation pro- 
vides strong constraints on the flow. In the case where 
p and / are constant, by taking the horizontal deriva- 
tives of the geostrophic velocity it can be shown that it 
is horizontally nondivergent. Taking the vertical deriva- 
tive and using the hydrostatic balance equation gives 
(3w g /3z, 3u g /3z) = 0. The equation for the conser- 
vation of mass (4) can then be applied to show that 
dwg/dz = 0. Hence, the geostrophic velocity does 
not vary in the vertical and is two dimensional. Under 
slightly more general conditions — namely for a slow, 
steady, frictionless flow of a barotropic (i.e., density 
depends only on pressure, so p = p(p)), incompress- 
ible fluid— it can be shown that the horizontal and verti- 
cal components of the velocity cannot vary in the direc- 
tion of the rotation vector 12, and hence the flow is two 
dimensional. This is known as the Taylor-Proudman 
theorem. It means that vertical columns of fluid remain 
vertical (they cannot be tilted or stretched). 

In general, however, the density in the atmosphere 
and the oceans does vary on pressure surfaces, as it 
depends on temperature, for example. In this case the 
fluid is said to be baroclinic. If the density can be 
written as p = po + a, where po is a constant refer- 
ence density (usually taken to be 1000 kg m -3 for the 
ocean) and cr <k p 0 (this is generally the case in the 
ocean), then replacing p by po in the denominator of 
the geostrophic velocity (6), taking 3/3 z, and making 
use of the equation for hydrostatic balance (1) gives 
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For the compressible atmosphere with larger density 
variations, it is often useful to consider the relevant 
equations with pressure as the vertical coordinate. This 
can be done through the introduction of a log-pressure 
coordinate defined as z = -H\n{p /po). Here, p o is a 
reference pressure (usually taken to be 1000 hPa) and 
the quantity H = RTo/g, known as the scale height, is 
the height over which the pressure falls by a factor of 
e. If To = 250 K (a typical value for the troposphere) 
then H « 7.3 km. From (3), the horizontal momentum 
equation in log-pressure in the case where friction can 
be neglected can be written as 

Dw/Dt + fzx u + V <P = 0, 

where D/Dt = d/dt + u- Vh + wd/dz with Vh represent- 
ing the horizontal gradient components and where $ is 
the geopotential. Additionally, the hydrostatic equation 
can be written as 


d<P/dz = RT/H. 
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These equations can be used to show that 
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In both the atmosphere and the ocean the vertical gra- 
dient in the geostrophic velocities is therefore related 
to the horizontal density gradient. This is known as the 
thermal wind relationship. As discussed before, there 
is a pole-to-equator temperature gradient in the atmo- 
sphere (/ _1 3T jdy < 0 in both hemispheres), which 
implies 3 u g /dz > 0 and, hence, that with increasing 
height the winds become more westerly in both hemi- 
spheres. Consistent with this, strong jet streams are 
observed in midlatitudes of both hemispheres in the 
upper troposphere (see figure 2). 

6.3 Ekman Layers and Wind-Driven Ocean 
Circulation 

Where frictional effects become important, such as 
close to boundaries, geostrophic balance no longer 
holds. For small Ro numbers, as found in the midlat- 
itude atmosphere and oceans, the acceleration terms 
in (3) can again be neglected in favor of the Coriolis 
terms, and hence the horizontal velocity u is given by 

fzxu+^Vp=f. (7) 

The result is that friction introduces an ageostrophic 
component of flow (high to low pressure), u = u g + u ag . 
This effect is important in the lower 1 km or so of the 
atmosphere and the upper 100 m or so of the ocean. 
The ageostrophic component explains, for example, 
the meridional (north-south) component of the trade 
winds. 

The geostrophic flow is horizontally nondivergent 
(except on planetary scales), but the ageostrophic flow 
is not. Winds that deviate ageostrophically toward low- 
pressure systems near the surface are convergent, and 
through mass continuity there must be an associated 
vertical velocity away from the surface. This is known 
as Ekman pumping. In the atmosphere Ekman pump- 
ing produces ascent, cooling, clouds, and sometimes 
rain in low-pressure systems. In the ocean it is a key 
component of the circulation in gyres. 

The wind stress at the ocean surface gives rise to a 
force per unit mass on a slab of ocean of 

1 STwukJ 
J wind -* » 

Po OZ 

where po is the density of the slab. This directly drives 
ocean circulations close to the surface in the Ekman 
layer. At the surface, z = 0, the stress is t(0) = 


Twind, and this decays over the depth 5 ~ 10-100 m 
of the Ekman layer, so t(-<5) = 0. The ageostrophic 
component of motion, M ag , is obtained by substitut- 
ing the force arising from the wind into (7), giving 
fz x u ag = (po) _1 3t/3z. By integrating this equation 
over the depth of the Ekman layer, it can be shown that 
the lateral mass transport over the layer is given by 

. , i *twind X Z 

Me k = J s PoMag dz = J . (8) 

The mass transport in the Ekman layer is therefore 
directed to the right (left) of the wind in the Northern 
(Southern) Hemisphere. Further analysis indicates that, 
in the Northern Hemisphere (directions are reversed in 
the Southern Hemisphere), (i) the horizontal currents 
at the surface are directed at 45° to the right of the 
surface wind and (ii) the currents spiral in an anti- 
cyclonic (clockwise) direction with depth through the 
ocean, decaying exponentially in magnitude away from 
the surface. Similar Ekman spirals exist at the bottom of 
the ocean and the atmosphere, but the direction of the 
flow is opposite. Winds at the surface are therefore 45° 
to the left of the winds in the lower troposphere above 
the planetary boundary layer, which means that the cur- 
rents at the sea surface are nearly in the direction of the 
lower tropospheric winds. 

In the anticyclonic subtropical gyres, Ekman trans- 
port results in the flow converging horizontally toward 
the center of the gyre. Mass conservation then implies 
downwelling through Ekman pumping (see figure 4). 
In the cyclonic subpolar gyres there is divergence and 
Ekman suction. In the incompressible ocean, the verti- 
cal velocity w is given by Vh -u ag + dw /dz = 0, since the 
geostrophic flow is nondivergent. The vertical velocity 
at the surface is zero, and so integrating this equation 
over the Ekman layer gives a vertical velocity at the base 
of the Ekman layer ic E k of 

tc Ek = — Vh-M Ek = — z- Vxf^f^Y 
Po Po V f ) 

The Ekman pumping velocity ic E k therefore depends on 
the curl of T W md/ /, which is largely set by variations in 
the wind stress. 

In the ocean interior, away from the surface or conti- 
nental boundaries, friction is negligible, and to a good 
approximation, the flow is in geostrophic balance. How- 
ever, the flow does respond to the pattern of verti- 
cal velocities imposed by the Ekman layer above. The 
horizontal divergence of geostrophic flow is associated 
with verticaf stretching of water columns in the interior 
because W^-Ug+dw /dz = 0 for an incompressible flow. 
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Figure 4 Schematic indicating the Ekman and Sverdrup 
transports (solid lines) and Ekman pumping (dashed lines) 
associated with wind-driven ocean gyres. 


From (6), the horizontal divergence of the geostrophic 
velocity is Vh • w g = ~(filf)v g , where p = df/dy. 
Hence, the vertical and meridional (i.e., northward) cur- 
rents are related by pv g = fdw/dz. There is therefore 
expected to be an equatorward (poleward) component 
to the horizontal velocity where u/Ek < 0 (iCEk > 0). This 
means that in the subtropical gyres the interior flow 
away from boundaries is equatorward (see figure 4). 
The interior flow must be consistent with the sense of 
the wind circulation, and so on the western side of the 
ocean basin, where frictional effects at the continental 
boundary mean geostrophy breaks down, there is a nar- 
row return flow. In the subtropical gyres the wind puts 
anticyclonic vorticity into the ocean, which is removed 
by friction at the boundary as the flow returns poleward 
on the western side. 

The full depth-integrated flow V can be obtained 
by considering the generic equation for the merid- 
ional velocity v, obtained from the incompressibility 
condition together with (7): 


P v =f^V + 


dw 1 3 (dr y 3t x \ 

3 z + po dz\ dx 3 y )' 

Integrating this from the bottom of the ocean (z = -D, 
iv = 0, t = 0) to the surface (z = 0, w = 0, t = T W i n d) 


gives 


fiV — Z ■ V X T w ind. 

Po 

Known as Sverdrup balance , this relates the depth of 
the integrated flow to the curl of the wind stress. It 


dictates the sense of the motion in the subpolar and 
subtropical gyres. In the Southern Ocean, which encir- 
cles Antarctica, at vertical levels where no topogra- 
phy exists to support east-west pressure gradients, 
there can be no mean meridional geostrophic flow and 
therefore the above Sverdrup approximation does not 
apply. 


6.4 Quasigeostrophic Flow and Baroclinic 
Instability 


For large-scale, low-frequency flows with small Ro 
number, the velocity can be split into a geostrophic 
part and an ageostrophic part, u = u g + n ag , with 
| u a g I / 1 Mg [ ~ O(Ro). In this case the momentum can 
be approximated by the geostrophic value, and the rate 
of change of momentum or temperature following hor- 
izontal motion can be approximated by the rate of 
change following the geostrophic flow. Thus, Du/Df a 
DgUg/Df = du g /dt + u g ■ Vu g , and the remaining terms 
of the momentum equation in log-pressure coordinates 
give 

DgUg . „ 

-l-T 1 = -f 0 z x Mag - Pyz x u g , 


where the /1-plane approximation has been used. 

The geostrophic velocity is nondivergent, and so 
there exists a stream function ip such that u g = 
z x Vip. Comparison with the geopotential indicates 
ip = (<P - 4>*)//o, with $*(z) being a suitable refer- 
ence geopotential profile. The continuity equation (4) 
in log-pressure coordinates becomes 

dttag + 3Uag ^ i_ 3(p*l3) _ 

dx 3 y p* 3z 

where p*(z) = poe~ z/H . The potential temperature 
6 is related to the temperature by 6 = TS, where 
E = exp(Rz/Hc p ). Using the hydrostatic balance 
equation, it can be written as a reference profile, 
(9*(z) = (H/R)Ed<P*/dz, plus a disturbance, 0' = 
(H /R)foEdip/dz. The thermodynamic equation can 
then be approximated by 


Pgfl' 

Dt 


30* 

I 

dz 


1 DQ 

Cn Df 


exp(^z/Hcp). 


Together, these form the equations for quasigeostroph- 
ic flow. When friction and diabatic heating are ne- 
glected, these equations can be used to demonstrate 
that the quantity q, the quasigeostrophic potential 
vorticit y given by 


q = fo + py + v 2 <p + 


p* 3 z 



fo 3ip\ 
N 2 dz) 
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is conserved following the flow: D g q/Dt = 0. Here, N is 
a buoyancy frequency given by 


, R 

N d (z) = — exp 


V hc d J 


30 * 
3 z 


Baroclinic instabibty is responsible for the develop- 
ment of eddies in the atmosphere and ocean. As noted 
earlier, such eddies play a central role in global heat 
transport. The instability is a common feature of flows 
in the atmosphere and the oceans because a rotating 
fluid adjusts to be in geostrophic baiance, rather than 
rest, and in this configuration the fluid has potential 
energy that is available for conversion to other forms 
by a redistribution of mass. For example, the vertical 
shear of the westerly flow in the midlatitude jet streams 
is in thermal wind balance with a horizontal temper- 
ature gradient, and this provides available potential 
energy for baroclinic instability. However, having avail- 
able potential energy in the fluid is not sufficient for 
instability since rotation tends to inhibit the release of 
this potential energy. 

A highly simplified description of the instability pro- 
cess can be obtained by considering the evolution 
of a parcel of air in an idealized background state 
consistent with the typical mean state of the lower 
atmosphere, i.e., one with sloping surfaces of constant 
potential temperature in the height-latitude plane (see 
figure 5). If a parcel of air moves upward in a wedge 
between a sloping surface of constant potential tem- 
perature and the horizontal, and is replaced by a simi- 
lar parcel moving downward (as indicated in figure 5), 
the warm air parcel (light gray) is then surrounded by 
denser air and will continue to rise, and vice versa for 
the cold air parcel (dark gray). The potential energy 
is reduced since the heavier parcel has moved lower 
and the lighter parcel has moved higher. The potential 
energy released can be converted to kinetic energy of 
eddying motions. This baroclinic instability process is 
associated with a poleward and upward transport of 
heat. 

Analysis of idealized flows can provide an indication 
of the typical properties of baroclinic instability. The 
classic example, known as the Eady problem, considers 
the simplest possible model that satisfies the neces- 
sary conditions for instability. The density is treated as 
a constant except where it is coupled with gravity in 
the vertical momentum equation, and both the Cori- 
olis parameter f and the buoyancy frequency N are 
assumed to be constant. This is known as the Boussi- 
nesq approximation. The model is set up with rigid lids 
at z = 0 and z = H. The flow u is considered to be 



Figure 5 Schematic indicating parcel trajectories relative 
to sloping surfaces of constant potential temperature (gray 
lines) within a wedge of instability (filled region) for a 
baroclinically unstable disturbance in a rotating frame. 


a background mean held u = (it, 0, 0) with a constant 
vertical shear, du/dz = A, plus a disturbance, u' . 

With these conditions, dq/dy = 0, which would 
imply that the flow was stable if it were not for the 
presence of the upper boundary. Linearizing the quasi- 
geostrophic potential vorticity equation gives 
3 


/ 3 . 3 \ , 

{Yt +u dxh 


= 0. 


The linearized form of the thermodynamic equation 
gives 

/ 3 . 3 1 3 it/ dip' du ,N 2 

\dt +u dx)l)F ~ + w 7o ~ °' 

Vertical shear at the upper and lower boundaries 
implies a temperature gradient in the ^-direction (i.e., 
in the cross-flow direction) through thermal wind bal- 
ance. The advection of temperature at the upper and 
lower boundaries must therefore be taken into account. 
First considering the lower boundary only, with u = 0 
the equations become 


V 2 ip' + 


fl 3 V 

N 2 3 z 2 


and 


d_ 

dt 


3«//\ dip' 


m 


A 


dx 


0 


0. 


Looking for wavelike solutions of the form 


ip' = Re ip(z) exp(i (fcx + ly + mz - cot)) 

leads to a dispersion relation co = kEl / {l 2 + k 2 ) 1/2 , 
where 17 = Afo/N. This gives an eastward phase speed 
c = co/k. A poleward displacement of an air parcel on 
the lower boundary will induce a warm anomaly, which 
will be associated with a cyclonic circulation. A neigh- 
boring equatorward displacement wall induce a cold 
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anomaly and an anticyclonic circulation. Consideration 
of the induced circulation pattern shows that the tem- 
perature anomaly will propagate to the east. Consider- 
ing the upper boundary only, u = AH and this leads 
to a dispersion relation to = kAH - kn / (i 2 + k 2 ) 1/2 . 
This gives a westward phase speed relative to the flow. 
These edge waves by themselves do not transport heat 
and do not release any energy from the system. The 
process of baroclinic instability in the Eady model relies 
on the presence of the edge waves and, crucially, their 
interaction. 

For the full problem, one can find solutions to the lin- 
earized quasigeostrophic potential vorticity equation 
of the same wavelike form, with < p(z) = Acosh/rz + 
B sinh kz, where k 2 = (l 2 + k 2 )N 2 / ffi and the boundary 
conditions define A and B. In this case the dispersion 
relation is 

to = kAHlj ± '/oOi «= 4 -^^— + ^ 2 - 

If a < 0 the flow is unstable. Instability therefore 
requires (l 2 + k 2 ) < kj? ~ 5.76/Ir, where k c is a critical 
wave number and Lr = NH / fo is the Rossby radius. So, 
there are only certain horizontal scales of waves that 
may grow exponentially, and the scale of these waves 
will depend on the rotation rate, the depth of the layer, 
and the static stability. 

For a given zonal wave number k, the most unsta- 
ble growing mode is that for which the meridional 
wave number l is zero. The wave number for maxi- 
mum growth is fc ~ 1.61/Lr, corresponding to a wave- 
length of I max = 2n/k ss 3.9 Ir and a growth rate 
cr si 0.31/oA/AT. Applying typical order-of-magnitude 
values for the atmosphere ( H ~ 10 km, U ~ 10 m s' 1 , 
and N ~ 10 ' 2 s' 1 ) gives L max ~ 4000 km and to S 3 
0.26 day -1 . For the ocean, H ~ 1 km, U ~ 0.1 m s' 1 , 
and N ~ 10 ' 2 s' 1 , giving L max « 400 km and to « 
0.026 day' 1 . The atmospheric values are broadly con- 
sistent with the observed spatial and growth rates of 
midlatitude weather systems. In the ocean, the simple 
scenario on which these values are based is not quan- 
titatively applicable, but the values give a qualitative 
sense of the scale and growth rate of instabilities in the 
ocean relative to the atmosphere. 

6.5 Rossby and Kelvin Waves 

Wave-like motions frequently occur in the atmosphere 
and oceans. One important class of waves is known as 
Rossby waves. The Rossby wave is a potential vorticity- 
conserving motion that owes its existence to an isen- 
tropic gradient of potential vorticity. 


In midlatitudes, taking the /1-plane approximation 
and considering a small-amplitude disturbance to a 
uniform zonal background flow u = (U, 0, 0), one finds 
a wavelike solution to the quasigeostrophic equations 
with a dispersion relation 

(ORossby -kU k 2 + l 2 +f 2 m 2, N 2’ 

where N has been assumed to be constant for simplic- 
ity. The zonal phase speed of the waves c = to / k always 
satisfies U - c > 0, i.e., the wave crests and troughs 
move westward with respect to the background flow. 

The Coriolis parameter is much smaller in the tropics 
than in the extratropics, and consequently the equato- 
rial [I -plane approximation, mwhichf ~ Py (where /l = 
20), sirup « y, and cos </> « 1, is used to explore the 
dynamics. Eastward- and westward-propagating distur- 
bances that are trapped about the equator (i.e., they 
decay away from the equatorial region) are possible 
solutions. Nondispersive waves that propagate east- 
ward with phase speed CKeivin = -JgH (where H is 
an equivalent depth) are known as equatorial Kelvin 
waves. Typical phase speeds in the atmosphere are 
CKeivin ~ 20-80 m s' 1 , and in the ocean CKeivin ~ 
0.5-3 m s' 1 . Another class of possible solutions is 
equatorial Rossby waves, whose dispersion relation is 

Pk 

COedRossby- (fc 2 + (2n + 1)j3/CKeivin) , 

where n is a positive integer. For very long waves 
(as the zonal wave number approaches zero), the 
nondispersive phase speed is approximately c eq R 0 ssby = 
-CKeivin/ (2n+ 1). Hence, these equatorial Rossby waves 
move in the opposite direction to the Kelvin waves (i.e., 
they propagate westward) and at reduced speed. For 
n = 1 the speed is about a third that of a Kelvin wave, 
meaning it would take approximately six months to 
cross the Pacific Ocean basin. 

7 Ocean-Atmosphere Coupling 

The dynamics of the ocean and atmosphere in the trop- 
ics are highly coupled. On interannual timescales, the 
upper ocean responds to the past history of the wind 
stress, and the atmospheric circulation is largely deter- 
mined by the distribution of sea-surface temperatures 
(SSTs). 

The trade winds that converge on the equator sup- 
ply water vapor to maintain convection. The convec- 
tive heating produces large-scale midtropospheric tem- 
perature perturbations and associated surface and 
upper-level pressure perturbations, which maintain the 



IV.30. Earth System Dynamics 


499 


low-level flow. The zonal mean of the vertical mass flux 
associated with this intertropical convergence zone 
constitutes the upward mass flux of the mean Hadley 
circulation. There are strong longitudinal variations 
associated with variations in the tropical SSTs mainly 
due to the effects of the wind-driven ocean currents. 
Overturning cells in the atmosphere along the equa- 
tor are associated with diabatic heating over equatorial 
Africa, Central and South America, and Indonesia. The 
dominant cell is known as the Walker circulation and 
is associated with low surface pressure in the western 
Pacific and high surface pressure in the eastern Pacific, 
resulting in a pressure gradient that drives mean sur- 
face easterlies (the Coriolis force is negligible in this 
region). The easterlies provide a moisture source for 
the convection in the western Pacific in addition to that 
provided by the high evaporation rates caused by the 
warm SSTs there. The atmospheric circulation is closed 
by descent over the cooler water to the east. 

Given a westward wind stress t x across the Pacific, 
assumed to be independent of x, equation (7) can 
be used to show that fiyv = -p^dTx/dz, assum- 
ing the response is also independent of longitude and 
using the equatorial /(-plane approximation. A west- 
ward wind stress across the Pacific therefore gives 
rise to poleward flows either side of the equator in 
the oceanic Ekman layer, which by continuity drive 
upwelling near the equator. In addition, since the Pacific 
is bounded to the east and west, the westward wind 
stress results in the ocean thermocline being deeper in 
the west than the east. The cold deep water therefore 
upwells close to the surface in the east, cooling the SSTs 
there, whereas in the west the cold water does not reach 
the surface and the SSTs remain warm. 

The upwelled region is associated with a geostrophic 
current in the direction of the winds, since in the limit 
y — 0, (7) gives flu = -pg 1 d 2 p/dy by l’Hopital’s rule. 
The deepening of the thermocline causes the sea sur- 
face to be higher in the west, assuming that flow below 
the thermocline is weak. There is therefore an eastward 
pressure gradient along the equator in the ocean sur- 
face layers to a depth of a few hundred meters. Away 
from the equator, below the surface, this is balanced by 
an equatorward geostrophic flow. At the equator, where 
/ = 0, there is a subsurface current directly down 
the pressure gradient (i.e., to the east): the Equatorial 
Counter Current. 

The east-west pressure gradient across the Pacific 
undergoes irregular interannual variations with a peri- 
od in the approximate range 2-7 years. This oscillation 
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Figure 6 Schematic of typical atmosphere/ocean conditions 
during (a) El Nino (negative SOI) and (b) La Nina (positive 
SOI) conditions (see text for details). 


in pressure, and its associated patterns of wind, tem- 
perature, and precipitation, is called the Southern Oscil- 
lation, an index of which (the Southern Oscillation Index 
(SOI)) can be obtained by considering the pressure dif- 
ference between Tahiti in the central Pacific and Dar- 
win, Australia, in the western Pacific (see figure 6). The 
negative phase of the SOI represents below-normal sea 
level pressure (labeled “L" in the figure) at Tahiti and 
above-normal sea level pressure (labeled “H”) at Dar- 
win, and vice versa for the positive phase. SSTs in 
the eastern Pacific are negatively correlated with the 
SOI, i.e., the phase with warm SSTs (known as El Nino) 
coincides with a negative SOI, and vice versa for the 
phase with cold SSTs (known as La Nina). The entire 
coupled atmosphere-ocean response is known as the 
El Nino-Southern Oscillation (ENSO). 

During an El Nino event, the region of warm SSTs 
is shifted eastward from the Indonesian region and 
with it the region of greatest convection and the associ- 
ated atmospheric circulation pattern (see figure 6). The 
resulting adjustment of the Walker circulation leads to 
a weakening of the easterly trade winds, reinforcing the 
eastward shift of the warm SSTs. The sea-surface slope 
diminishes, raising sea levels in the east Pacific while 
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lowering those in the west. Ekman-driven upwelling in 
the ocean reduces, allowing SSTs to increase. The ocean 
adjusts over the entire basin to a local anomaly in the 
forcing in the western Pacific through the excitation 
of internal waves in the upper ocean. We can explain 
the subsequent evolution using the properties of ocean 
waves derived in the previous section. An equatorial 
Kelvin wave propagates rapidly to the east, reinforcing 
the initial warm SST anomaly in a positive feedback, 
and equatorial Rossby waves propagate slowly (with 
group velocity about a third of the Kelvin wave) to the 
west. As the Kelvin wave propagates east it deepens 
the thermocline, relaxing the basin-wide slope. When 
it hits the coast on the eastern side (after about two 
months), its energy feeds westward Rossby waves and 
poleward coastal Kelvin waves. On the western side, 
when the Rossby waves hit the coast, some energy feeds 
an eastward-propagating Kelvin wave, which raises the 
thermocline back toward its original location, reducing 
the initial SST anomaly and providing a negative feed- 
back. The propagation times of the waves mean that the 
negative feedback is lagged, resulting in a simple model 
of a delayed oscillator, which provides an explanation 
for the observed ENSO periodicity. 

In the coupled system the ocean forces the atmo- 
spheric circulation (through the response to changed 
boundary conditions associated with the El Nino SST 
fluctuations) and the atmosphere forces the oceanic 
behavior (through the response to changed wind stress 
distribution associated with the Southern Oscillation). 

8 Outlook 

Mathematics has played, and continues to play, a cen- 
tral role in developing an understanding of the Earth 
system. It has provided insight into many of the fun- 
damental processes that make up the weather and 
climate, and crucially it has also provided the nec- 
essary framework underpinning the numerical mod- 
els used in future prediction. Some of the key areas 
of research in which mathematicians have a vital role 
include developing the tools that allow observational 
data to be assimilated into forecast models, applying 
concepts from dynamical systems theory to the eval- 
uation of past climate and modeling of future vari- 
ability, exploiting a range of mathematical techniques 
to develop novel model parameterizations for subgrid 
processes, and using advanced statistical techniques 
for data analysis. 
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IV.31 Effective Medium Theories 

Ross C. McPhedran 

1 Introduction 

Effective medium theories arise in a variety of forms 
and in a wide variety of contexts in which one wants 
to model properties of structured media in a simpli- 
fied way. The basic idea is to take into account the 
structure of the heterogeneous medium by calculating 
an equivalent homogeneous “effective medium” and to 
use the equivalent medium in further calculations. For 
example, one might be considering an optical mate- 
rial composed of two different components, the optical 
properties of each of which is known, and one might 
want to put the two components together in a struc- 
tured material that behaves like a homogeneous mate- 
rial with properties differing from those of each com- 
ponent. Various of the effective medium theories that 
have been devised are of use in this particular exam- 
ple, which is one of the first technological examples 
of its use. (In fact, for several thousand years metals 
have been put into glass melts in ways designed to give 
particular optical coloration effects.) 

A wide class of effective medium treatments have 
proved useful in the study of the transport properties 
of composite materials. The basic governing relation 
is Laplace’s equation, and the solution is expressed in 
terms of a distribution of a field quantity, with its asso- 
ciated flux. Using the language of one such property, 
electrical conductivity, the held quantity is the electric 
held and its flux is the current. The materials constitut- 
ing the composite are specihed by their conductivity, 
and the effective medium equations give the effective 
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conductivity of the composite. The basic equations are 

j(x) = cr(x)e(x), V ■ j = 0, V x e = 0. (1) 

These vector fields may be real or complex, depending 
on the particular transport problem under considera- 
tion. If the electric field e is written as the gradient of 
a potential function, the last equation in (1) reduces to 
Laplace’s equation. 

Table 1 gives seven instances of this problem, with 
the equivalent terms for conductivity being given in the 
last column. 

The seven examples of transport coefficients are all 
mathematically equivalent. However, different practi- 
cal considerations result in particular effective medium 
approaches being better suited to some contexts than 
others. In particular, in optical applications of compos- 
ite media involving metals mixed in with dielectrics, the 
electric permittivity of the metal is complex, as is the 
effective permittivity of the metal-dielectric composite. 
This may make it more difficult to establish an accurate 
effective permittivity formula in this case. 

2 Conditions for Effective Medium Theories 

Effective medium theories require a number of condi- 
tions concerning the characteristics of the structured 
medium to be modeled to be satisfied. The structured 
medium has to be divided into regions of the com- 
ponent materials that contain sufficient numbers of 
atoms or molecules for bulk properties to be used for 
those regions. It must also include a sufficiently large 
number of regions to permit construction of effective 
medium formulas that do not need to take into account 
finite-size effects. Typically, the construction of equiv- 
alent homogeneous regions will occur on a scale much 
smaller than the size of the sample of the structured 
medium but much larger than that of the regions of its 
component parts or of its separated parts. (An impor- 
tant class of composite media has a background con- 
tinuous phase, sometimes called the matrix phase, into 
which separated components of other materials are 
inserted.) 

The construction of appropriate effective medium 
formulas for composite structures is called homoge- 
nization. The process of homogenization has to take 
into account the geometry of the composite: whether it 
is two dimensional or three dimensional; whether it is 
isotropic or anisotropic; whether the different materi- 
als of which it is composed have similar spatial distri- 
butions, or whether there is a matrix phase into which 


the regions of other materials are inserted. The require- 
ments on the effective medium formula in the prob- 
lem under investigation also need to be understood. If 
a high-accuracy model is needed, then precise details 
of the geometry of the system will generally need to be 
incorporated into the theory. On the other hand, it v\dll 
sometimes be enough to have upper and lower bounds 
on the transport coefficient, in which case much more 
general and simpler procedures will be adequate. 


3 Effective Medium Formulas of 
Maxwell Garnett Type 


The early history of formulas based on the effective 
electric permittivity of systems of particles character- 
ized by the dipole moment they develop when placed 
in an external field is complicated. It spans the period 
from 1837 (when Faraday proposed a model for dielec- 
tric materials based on metallic spheres placed in an 
insulator) to 1904 (when J. C. Maxwell Garnett used for- 
mulas of this type in design studies on colored glass). 
Essentially similar formulas are given various names, 
depending on the first investigator to use them in a 
particular context, and the approximations inherent 
in them are sometimes glossed over. One example of 
the complexity of this history is that the formula put 
forward by Maxwell Garnett was essentially equivalent 
to one developed previously by his godfather, J. C. 
Maxwell, so the formula is often, justifiably, written 
Maxwell-Garnett. 

What is generally called the Clausius-Mossotti for- 
mula gives the effective electric permittivity (f e ff) of a 
set of polarizable inclusions placed in a three-dimen- 
sional background material of electric permittivity £|,: 

( Na \ 

£eS = £h ( T^NaJl)’ 

where N is the number of inclusions per unit volume, 
and « is their polarizability. To arrive at the Maxwell 
Garnett formula, we replace « by the polarizability of 
a sphere of electric permittivity £ p and radius a\ 


= 4t to 3 


(Jp - jb_\ 

Vfn + 2fb/' 


feff = fb 1 


I _ ' 


( 2 ) 


Introducing the volume fraction f = 4nNa 3 / 3, the 
Maxwell Garnett result is 

3/ Up - £b) 

3fb + (1 - /)(£ p - £b) 

The Maxwell Garnett formula is asymmetric in the 
variables f p and ft,, since the former corresponds to 
isolated particles and the latter to the continuous back- 
ground phase. It is also a purely dipole formula, in that 
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Table 1 Equivalent physical transport problems governed by Laplace's equation. 


Problem 

j 

e 

a 

Electrical conduction 

Dielectrics 

Magnetism 

Thermal conduction 
Diffusion 

Flow in porous media 
Antiplane elasticity 

Electrical current j 
Displacement field d 
Magnetic induction b 
Heat current q 

Particle current 
Weighted velocity 

Stress vector (Ti 3 ,T 23 ) 

Electric field e 

Electric field e 

Magnetic field h 

Temperature gradient -VT 
Concentration gradient -Vc 

Pressure gradient VP 

Vertical displacement gradient V M 3 

Electric conductivity a 
Electric permittivity £ 
Magnetic permeability p 
Thermal conductivity k 
Diffusivity D 

Fluid permeability k 

Shear matrix p 


field expansions near the particles are limited to sin- 
gle dipole terms. As the volume fraction increases, the 
isolated particles come closer to each other, interact 
more strongly, and the dipole approximation increas- 
ingly becomes inaccurate. A consequence of it being 
a dipole formula is that the Maxwell Garnett equation 
has a single pole, located when the following equa- 
tion for the permittivity ratio between particles and 
background material is satisfied: 

m.-mr 

This equation specifies what is termed the plasmon res- 
onance and gives the condition for the optical effects 
of the particles to be particularly strong; it is never sat- 
isfied exactly by physical systems, but it can be approx- 
imately satisfied. The other interesting case is when 
feff = 0; this occurs when 



In this case, the presence of the particles cannot be 
detected by measurements of £ e ff. The plasmon reso- 
nance occurs when the permittivity ratio is real, nega- 
tive, and below - 2 , while the zero of £ e ff for the Maxwell 
Garnett equation occurs when the permittivity ratio lies 
between -2 and 0. 

As an example of the optical application of the 
Maxwell Garnett formula, in figure 1 we show the effec- 
tive electric permittivity as a function of the wavelength 
of spherical silver particles in a silica matrix, with the 
silver occupying a small volume fraction (10%) of the 
composite. The strong permittivity resonance of the 
composite would occur for an ideal metal when £ p / £b = 
-2.33 for f = 0.10; for the composite shown, the res- 
onant condition (3) is most closely satisfied when the 
wavelength is around 0.40 pm, so the composite would 
have a strong reflectance in the violet spectral region. 
The closest approximation to £ e ff = 0 occurs just to the 
short-wavelength side of this peak. 



Figure 1 The real (solid) and imaginary (dashed) parts of 
£ e ff, as a function of wavelength, for silver spheres occupy- 
ing 1 0 % by volume of a composite, in a background material 
of silica. The electric permittivity of silver is a strong func- 
tion of the wavelength, and experimental data is used both 
for its permittivity and for that of silica. 


Formulas can be developed that take into account 
the effect of higher-order multipoles, not just dipoles, 
assuming, for example, a specific regular arrangement 
of spherical particles. They generally use a method due 
to Lord Rayleigh published in 1892. Such formulas take 
into account the arrangement of the particles using lat- 
tice sums, one for each relevant spherical harmonic 
term. They also have one resonance for each multipole 
term, with the resonances tending to cluster around the 
permittivity ratio of -2. 

The two-dimensional case of arrays of cylindrical par- 
ticles placed in a background material is also of impor- 
tance, particularly in the study of photonic crystals and 
metamaterials. The Maxwell Garnett formula based on 
circular cylinders that is equivalent to equation (2) is 


f eff ( f b» f pi./) — ^b 


fb + £p ~ /(f b ~ fp) 

£b + £p + /(£ b - £p) 


]■ 


This exactly satisfies an important duality relationship 
due to J. B. Keller, true for cylinders of arbitrary cross 
section: 


feff(fb,f p ;/)feff(fp,£b;/) = £b £/■ 
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The duality relationship does not have an equivalent in 
three dimensions. It has important consequences for 
the underlying structure of two-dimensional transport 
problems, since it provides a link between zeros and 
poles of the effective transport coefficient. 

It should be noted that the Maxwell Garnett formula 
(2) is exact for a specific geometry, given by the Hashin- 
Shtrikman construction. The idea is to calculate the 
value (eo) of a background permittivity such that, if 
the electric held in this medium is uniform, a sphere 
of core permittivity e p and shell permittivity Eb can be 
inserted into the medium of permittivity eo without dis- 
turbing the uniform held there. The core and shell radii 
of the coated sphere must be such that the volume of 
the core divided by that of the shell is //(l - /). The 
value so calculated for eo turns out to be exactly that 
given by (2). One can then continue to insert copies of 
such coated spheres repeatedly into the background, 
scaling them down when necessary so that all the vol- 
ume corresponding to the material of permittivity eo 
eventually disappears. All stages of the construction 
preserve the Maxwell Garnett (or Hashin-Shtrikman) 
effective electric permittivity. 

4 Effective Medium Formulas 
of Bruggeman Type 

As we have remarked, the Maxwell Garnett formula, 
and related expressions, take one material in a com- 
posite to extend infinitely while the second material 
takes the form of separated inclusions. In 1935 Brugge- 
man constructed an alternative form of theory that 
placed both materials on the same footing, i.e., both 
were assumed to have the same topology. The resul- 
tant formula is called the Bruggeman symmetric effec- 
tive medium formula; it may also be called the coherent 
potential approximation, after a random alloy theory 
from solid-state physics. 

We consider an aggregate structure composed of 
grains filling all space. The grains are of type 1 (vol- 
ume fraction /i, electric permittivity Ei) or type 2 (vol- 
ume fraction /2 = 1 - /i, electric permittivity £2). The 
effective permittivity of the composite will be denoted 
eb. To obtain a formula for it, we pick a representative 
sample of grains occupying a small volume fraction S 
of the granular composite. The grains in the sample 
are assumed to be well separated from each other, and 
they must be chosen so that the sample has the correct 
volume fractions of each. 

The derivation uses the self-consistency assumption 
that the effective permittivity of the composite remains 


equal to eb to first order in 5 when we replace the 
medium surrounding the representative grains by a 
homogeneous effective medium with electric permit- 
tivity eb. After this replacement has been made, we 
treat the representative grains as a dilute suspension 
of spherical grains embedded in the medium specified 
by eb. Correct to first order, the effective permittivity 
of the suspension is 



3(fi - e b ) 
El + 2 e b 


+ <5/2 


3 (£2 ~ £b) 

£2 + 2eb J ’ 


giving the equation 


3(ei - eb) + r 3 (£2 - eb) 
£1 + 2eb 2 £2 + 2£b 


= 0 . 


The result is a formula that is symmetric in the prop- 
erties of the two types of grains: 

£B = \[y ± (y 2 + 8ei£ 2 ) 1/2 ]. (4) 


The choice of the plus or minus sign in (4) should be 
made so that the imaginary part of eb has the same sign 
as the common sign of the imaginary parts of £1 and £2 . 

The Bruggeman expression (4) has several features 
that distinguish it from the Maxwell Garnett type of 
theory. As remarked, it has no preferred topology for 
the two media filling the composite. Rather than a res- 
onant peak, it has a branch cut, arising when the switch 
is made between the two alternatives in (4). It also has 
a percolation threshold, which is manifest if we take 
£2 = 0 and Ei real. Imposing the criterion that eb be 
nonnegative if e 1 > 0, the result for this particular case 
is 

£ _ J(3/i - l)fi/2 when /, > i, 

B [0 when/i < j. 

This shows that in this case there is a percolation 
threshold at /1 = 3. 

In figures 2 and 3 we show the results of the symmet- 
ric Bruggeman theory for the silver-silica composite of 
figure 1, but for two volume fractions: one below the 
percolation threshold and the other above it. The per- 
colation threshold is not sharply defined in the case 
of complex electric permittivities, but there are clear 
changes that occur as the volume fraction of silver 
increases. For small metal volume fractions, the grains 
do not get close enough together to create the sort 
of current paths needed to form a strong imaginary 
part of the effective permittivity. Above the percola- 
tion threshold, the imaginary part gradually strength- 
ens, dominating the real part above a volume fraction 
of 60%. 
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Figure 2 The real (solid) and imaginary (dashed) parts of 
f e ff, according to the Bruggeman formula, as a function 
of wavelength for silver occupying 30 % by volume of a 
composite, with 70 % of the composite volume being silica. 
Experimental data is used both for the permittivity of silver 
and for that of silica. 



Figure 3 As for figure 2 , but now with a silver volume 
fraction of 60 %, above the percolation threshold. 


5 The Bergman-Milton Bounds 


Bounds on the effective transport coefficients of com- 
posite materials are very useful in checking theoretical 
results and deducing information from experimental 
measurements. They can be particularly useful if they 
are sharp , i.e., there is at least one physical geometry 
for which the bounds are attained. In the following brief 
discussion of bounds we will use the notation a for the 
conductivity problem, with <T e ff being the (real-valued) 
effective conductivity of a composite made of materials 
with two real conductivities cr i and 02 and volume frac- 
tions f\ and /2 . We will use the notation f e ff for bounds 
appropriate to the case where £ \ and £2 are allowed to 
be complex. 

Taking 01 > 02, the Hashin-Shtrikman lower bound 
for three-dimensional composites is 


0-eff ^ 02 1 + 


3/1(01-0-2) 

3ct 2 + f 2(0-1 - 0-2). ' 


( 5 ) 


This bound is sharp, being attained by an assemblage 
of coated spheres with phase 1 as the core and phase 2 


as the coating. The corresponding upper bound is 


O off ^ O-i 



3/2(01 -cr 2 ) 1 

3 oi -/i(oi - 02)/ 


( 6 ) 


This bound is attained by an assemblage of coated 
spheres with phase 2 as the core and phase 1 as the 
coating. 

For the complex case, independent work by Bergman 
and Milton enabled the derivation of tight bounds on 
the effective permittivity £ e ff of two-phase composites 
with known volume fractions. Instead of inequalities 
such as those in ( 5 ) and ( 6 ), the complex value of £ e ff 
is constrained to he in a specific region in the complex 
plane. The region is defined by straight lines and arcs of 
circles. If volume fractions are unknown, the region is 
defined to lie inside the area bounded by a straight line 
(the arithmetic mean of £1 and £2 as f\ and /2 = 1 - /1 
vary) and a circle (defined by the harmonic average, i.e., 
the reciprocal of the arithmetic mean of the quantities 
1 /fi and 1 / £2 as f\ and /> = 1 - fi vary). Knowing the 
volume fractions f\ and /?, one can constrain £ e ff to he 
in a region between two arcs of circles defined by com- 
mon endpoints and one extra point for each. The com- 
mon endpoints correspond to the points on the outer 
region boundaries for the specified values of f\ and/2. 
The extra points correspond to the Hashin-Shtrikman 
coated-sphere assemblages with phase 1 and phase 2 
as the coating. If it is known that the composite is 
isotropic, as well as having the specified volume frac- 
tions, then a still-smaller region in the complex plane 
can be defined. 

This construction of bounds can be viewed as a recur- 
sive process. At each stage one has a region in the com- 
plex plane. As an extra piece of information about the 
composite is specified, two points on the boundary of 
the old region consistent with the new information are 
selected, and a smaller region is constructed with these 
new points on its boundary. It has been shown that as 
more and more information is specified, the bounds 
converge to a specific point: the exact complex per- 
mittivity for the (by now) uniquely specified composite 
material. The area in the complex plane between the 
bounds shrinks rapidly as more information is added 
if £1 and £2 are well away from the negative real axis. 
However, if their ratio is close to the negative real axis, 
the area shrinks much more slowly as information is 
added. 

The Bergman-Milton bounds can also be used in 
the inverse problem for composites: given measured 
data on effective transport properties of a compos- 
ite medium, what can be said about the medium’s 
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structure? In particular, one would like to be able to 
infer geometrical information, principally the volume 
fractions of the phases making up the composite. An 
important example of such an inverse problem is in 
geophysics and resource extraction, where it is of great 
value to be able to infer, say, the volume fraction of oil 
in a fluid-permeated sandstone structure using bore- 
hole electrical tomography. Another example in envi- 
ronmental science concerns the estimation of the ratio 
of brine to pure ice in sea ice [V.17]; this is of value 
when trying to deduce the melting-freezing history of 
sea ice over several seasons. 
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IV.32 Mechanics of Solids 

L. B. Freund 


1 Introduction 

The builders of ancient machines and structures must 
have been concerned with the strength and durability 
of their creations. Presumably, the standards applied 
had evolved over time through trial and error. 

Early evidence of systematic scientific study of the 
strength of materials appears in the work of Leonardo 
da Vinci (1452-1519), who experimented with circu- 
lar rods of various diameters in order to learn about 
the features that influence tensile breaking strength. A 


more systematic study of rods in tension, as well as 
beams in bending, was undertaken by Galileo (1564- 
1642). Among his conclusions were that the tensile 
breaking force of a uniform rod does not depend on 
its length but does depend proportionally on its cross- 
sectional area and that rupture of a cantilever beam 
loaded by a weight at its free end usually initiates on 
the top side of the beam near the support. The idea of 
the force applied to a solid object being related in a 
characteristic way to the deformation induced by that 
force was first stated for the linear elastic response of 
a watch spring by Hooke (1635-1703) in the form of a 
Latin anagram that, when unscrambled, reads “As the 
extension, so the force.” 

With the publication of the Principia by Newton 
(1642-1727), an approach emerged whereby the study 
of physical phenomena began with a statement of phys- 
ical principles, with a view toward deducing the behav- 
ior of the physical world on the basis of those princi- 
ples by mathematical means. Leonhard Euler (1707-83) 
and Daniel Bernoulli (1700-1782) developed the theory 
of elastic beams in bending that bears their names; 
Coulomb (1736-1806) put that theory into the form of 
a one -dimensional structural theory that is in common 
use today. Eventually, it was the great post-Renaissance 
engineer and mathematician Cauchy (1789-1857) who 
unified what was known at the time about stress, strain, 
and linear elastic material behavior to devise a full 
three-dimensional theory of the mechanics of a solid 
continuum. In the 250 years since, a long list of types of 
material behavior has evolved— theories for describing 
geometrically nonlinear deformation have been devel- 
oped, material inertial effects have been incorporated 
into the quantitative description of the response of 
a solid to various types of applied loads, and spe- 
cial theories have been developed for solid configu- 
rations that are very thin in one or two dimensions, 
as for beams, plates, or shells. These contributions 
have been accompanied by the development of an 
array of mathematical and computational methodolo- 
gies for solving boundary-value problems based on par- 
tial differential equations believed to describe these 
phenomena. 

During the twentieth century, the mechanics of solids 
grew into an applied science that now underlies much 
of the practice of mechanical engineering and civil 
engineering. It is also widely applied within mate- 
rials science, planetary geophysics, and biophysics, 
for example. The area includes a large and active 
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experimental component, as well as a major computa- 
tional component. 

1.1 A Solid Material 

The identification of matter filling some portion of 
space as a solid is based on the recognition of a con- 
stitutive property of that material; namely, any mate- 
rial that can resist imposed forces of a measurable 
magnitude that tend to shear the material, without evi- 
dence of ongoing deformation, is a solid. Otherwise, 
the material is a fluid. The separation of materials into 
these categories is not perfect, in that some materials 
behave as either fluid or solid depending on the rate of 
deformation and/or on the temperature. 

A solid body is some solid matter filling a region 
of space. The constituent material has certain char- 
acteristics relevant to mechanical phenomena: mass 
per unit volume, resistance to deformation, and ulti- 
mate strength, for example. Furthermore, any piece 
of the solid that is removed from the whole exhibits 
these same material characteristics, and these proper- 
ties remain unaltered as the solid is repeatedly divided 
into indefinitely smaller pieces. Although this contin- 
uum point of view overlooks the discrete small-scale 
structure of materials, it facilitates modeling and analy- 
sis at a scale insensitive to that structure. For example, 
suppose that a small volume of a material, say v(p), 
bounded by a closed surface surrounding the particu- 
lar point p in the solid has total mass m(v). Then the 
mass density at point p is 

m(v) 

p(p) = lim— (1) 
v-o v(p) 

In general, this property of indefinite divisibility makes 
it possible to define both material properties and 
mechanical fields as continuous functions of position 
throughout a solid. The identification of the differen- 
tial equations governing these fields, together with the 
solution of these equations subject to boundary con- 
ditions representing external influences, is the essence 
of solid mechanics. 

1.2 A Conceptual Map 

Before launching into the details of the subject, a con- 
ceptual map of solid mechanics is provided as an 
aid to understanding the relationships among various 
ideas that are central to the subject (see figure 1). 
Examination of the subject begins with the consider- 
ation of three fundamental ideas. The first of these is 
labeled displacement, a topic that encompasses rules 



Figure 1 A conceptual map of solid mechanics. 


for locating material particles in the Euclidean space in 
which we live and for representing changes in the posi- 
tions of these particles. The next fundamental idea is 
that of material strain, which is focused on the way 
in which the deformation of a material is described 
quantitatively in terms of the change in the length 
of material line elements that join adjacent material 
particles or the change in angle between two mate- 
rial line elements emanating from the same material 
particle. 

The third principal idea, stress, brings mechanical 
action into the picture. Generally, the concept of stress 
provides the basis for describing the mechanical force 
exerted by the material on one side of an interior sur- 
face on the material on the other side of that surface. 

At the displacement-strain intersection, determining 
strain for any prescribed displacement distribution and 
the more challenging issue of determining displace- 
ment for a specified strain distribution are addressed. 
The stress-strain intersection is where characterization 
of the mechanical behavior of materials enters; many 
types of stress-strain relationships can be postulated, 
and experiment is always the final arbiter of ideas in 
this domain. Finally, the physical principle of conser- 
vation of momentum comes into play at the stress- 
displacement intersection. The concepts incorporated 
in these three intersection zones are the essential ingre- 
dients in the solution of any mathematical problem in 
the mechanics of solids, as suggested by their intersec- 
tion at the center of the conceptual map. The next task 
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is to describe the means by which these concepts are 
given useful mathematical forms. 

2 Fundamental Concepts 

As it deforms, a solid body occupies a continuous 
sequence of configurations. One of these configura- 
tions is adopted as a reference configuration for pur- 
poses of description; this is usually the configuration 
at the start of a deformation process of interest but the 
choice is arbitrary. 

The positions of material points, changes in the posi- 
tions of material points, and points in space are repre- 
sented by vectors. For this purpose, a set of orthonor- 
mal basis vectors e^, k = 1, 2, 3, is introduced. 

Any mathematical quantity represented by a symbol 
in boldface denotes a vector or tensor, with the pre- 
cise meaning implied by context. As an aid in calcula- 
tion, index notation is commonly adopted for compo- 
nents of vectors and tensors. The implied range of any 
index is 1-3, unless a note to the contrary is included; 
repeated indices in an expression indicate an inner 
product, with summation over their range implied. The 
rules governing the use of index notation are clearly 
summarized in Bower’s Applied Mechanics of Solids. 

The terms material point, material line, and material 
surface are used often in the sections that follow. It 
should be understood from the outset that these terms 
are to be interpreted literally. For example, a material 
line, once defined, always coincides with the same set 
of material points in the course of deformation. 

2.1 Displacement 

Every material point can be identified with the point in 
space with which it coincides in the reference configu- 
ration, and the position of the generic point is denoted 
by the vector 

X = x\R\ +X2<22 +X3e 3 = Xkek. (2) 

A set of basis vectors is shown in figure 2 , which depicts 
a simple deformation. The reference configuration of 
the solid is a cube of edge length fo, and the point at 
one of the vertices of the cube is identified by position 
x. In (2), the repeated index is understood to represent 
summation over the implied range of that index; this 
summation convention is followed in all subsequent 
mathematical expressions. 

The shaded object in figure 2 represents a possi- 
ble deformed configuration of the solid body that had 
occupied the cubic region. The figure suggests that the 



Figure 2 A deformed configuration of a solid body 
that was a cube in its reference configuration. 


solid has been stretched in the 3-direction and has been 
contracted in the 1- and 2-directions. Of particular note 
is the position of the corner point that was identified 
as x in the reference configuration; this particular point 
has displaced to the position g = f,e, in the deformed 
configuration. The displacement of this point is the vec- 
tor difference between these positions. If the displace- 
ment vector is denoted by u, then the displacement of 
the corner point is 

u = g - x <=> Ui = g; — Xi. (3) 

The displacement of any point in the reference 
configuration of the solid, uniquely identified by its 
coordinates xi, X 2 , xj, with respect to the underly- 
ing reference frame, can be determined similarly. When 
defined in this way, the quantity m(Xi,X 2 ,X 3 ) = u(x)is 
seen to define a continuous vector field over the refer- 
ence configuration of the solid that describes the com- 
plete displacement field associated with a deformation. 

2.2 Strain 

Commonly, the displacement field m(xi,X 2 ,X 3 ) varies 
from point to point throughout the reference configura- 
tion of the solid body. Also, the length of a material line 
element connecting any two adjacent material points 
in the deformed configuration differs from the length 
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of that line element in the reference configuration. 
The concept of strain is introduced to systematically 
quantify this difference. 

Suppose that we focus on a generic material point 
within the solid that is identified in the reference con- 
figuration by its position x, (not necessarily the cor- 
ner point illustrated in figure 2) and on a second mate- 
rial point an infinitesimal distance away at position 
Xi + dx i. The length and orientation of the material 
line joining these two points is defined by the vector 
dx*. After deformation, these points have moved to 
the positions g; and g, + d g,-, respectively. The map- 
ping of each line element dx* to dg,- characterizes the 
deformation at that material point. 

Following deformation, the location of the endpoint 
of the infinitesimal material line of interest is 

gi + dgi = X,- + dx, + m(x i + dxi , ...). (4) 

The displacement vector at this point is expanded in 
a Taylor series at the point Xj and only the first-order 
terms in the increments dx, are retained. Then 

Ui (xi + dxi , . . . ) ~ Ui (xi ,...) + djUi (xi , ... ) dxj , 

where djut denotes the matrix of components of the 
displacement gradient tensor in the reference config- 
uration. In view of (4), the deformed material line 
element is 


dg, — Fjjdxj, Fij — 5ij — djUi , (5) 


where 8ij is the identity matrix and Fij is the deforma- 
tion gradient. Thus, the deformation gradient provides 
a complete description of the deformation of the neigh- 
borhood of a material point. Furthermore, the determi- 
nant of the deformation gradient at a point in a defor- 
mation field yields the ratio of the volume of a small 
material element at that point to the volume of that 
same material element in the reference configuration. 

Suppose for the moment that the infinitesimal mate- 
rial line element represented by dx, in the reference 
configuration has length dso and direction m^i, with 
= 1, and that dgjdg; = d5 2 ; recall that a 
repeated index in an expression implies summation 
over its range. Then, forming the inner product of each 
side of (5)i with itself and dividing by dig yields 


d.? 2 - d5n 


= 2 Eijmimj, 

Eij = \(djUi + diUj + diUkdjUk), 


d-^o 


( 6 ) 

(7) 


where Eij is the symmetric matrix of components of 
Lagrange strain. 


To develop some sense of the geometrical character 
of Eij, suppose that Ai = d5/d.so: the stretch ratio of a 
material line element that has direction m = e i in the 
reference configuration. Then (6) implies that 

A 2 - 1 = 2E n . (8) 

Similarly, for two infinitesimal material lines emanat- 
ing from the same material point in the reference con- 
figuration, it is possible to express the angle between 
these two lines in the deformed configuration in terms 
of the Lagrange strain. For example, consider the line 
elements dxf = dig mf and dxf = d5g m\. For the case 
when m b = 0, the result is 

A 11 A b cos(j7T - y ab ) = ZEijmfmj, (9) 

where y ab is the reduction in angle between the lines, 
called the shear strain, and A fl and Ap are the stretch 
ratios of the line elements. For the particular case when 
the a- and b-directions are the 1- and 2-directions 
and when the stretch ratios are equal to 1, this result 
reduces to sin(yi 2 ) = 2 £i 2 = 2 F 21 . 

2.2.1 Small Strain 

Up to this point in the discussion of strain, no assump- 
tion about the deformation has been invoked, other 
than continuity of the displacement field. In particu- 
lar, the expressions for stretch and rotation are valid 
for deformations of arbitrary magnitude. These expres- 
sions simplify considerably in cases for which the 
deformation is “small” in some sense. Usually, the 
strain is understood to be small if \diUj\ <K 1 for 
each choice of the indices i and j. If the deformation 
of a solid material is small, then the strain (7) can be 
approximated by the small strain matrix 

Eij = \(djUi + diUj). (10) 

The stretch ratio in the 1 -direction is simply 

d5 

-j- = l + £ii, (ID 

d5o 

and the expression for shear strain given in (9) becomes 

y tj = 2fy, i*j. (12) 

Deformations falling outside the range of small strains 
are common, for example, when material line elements 
are stretched to several times their initial lengths in 
metal forming or when a thin compliant solid, such as 
a plastic ruler, is bent into a U-shape for which material 
line elements rotate by as much as rr/2. 
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2.2.2 An Example of Deformation 


Suppose that the solid body depicted as a cube of edge 
length T 0 in figure 2 deforms into a rectangular par- 
allelepiped with edge length f > To in the 3-direction 
and with every plane section § 3 = const, being one and 
the same square shape. This is an example of a homoge- 
neous deformation ; that is, all initially cubic portions of 
the solid with edges aligned with the coordinate direc- 
tions deform into the same shape. Two consequences 
of homogeneity are that the deformation gradient is 
constant throughout the reference configuration and 
that the displacement field is linear in xi, X 2 , and X3. 
Finally, rigid body motions are ruled out by requiring 
that the material line initially along X\ = X 2 = 
remains along that line and that the surface x 3 = 0 
remains in that plane. These features of the defor- 
mation constrain the displacement field u(x ) in the 
reference configuration to be 


Ui(xi,x 2 ,x 3 ) = (A - 1)(X1 - 2 ^ 0 ), 
u 2(Xi,X2,X 3 ) = (A - 1)(X2 - \fo), 

m 3 (xi,x 2 ,x 3 ) = (A 3 - l)x 3 , 


(13) 


where A 3 = T/To is the imposed stretch ratio in the 3- 
direction and A is the unknown stretch ratio in both the 
1- and 2-directions. 

What value of A ensures that the deformation is 
volume-preserving? This question can be addressed by 
noting that the local value of the determinant of the 
deformation gradient matrix is the ratio of the volume 
of a certain infinitesimal material element in the cur- 
rent configuration to the volume of that same material 
element in the reference configuration. In the present 
case, the deformation gradient is 


F = 


A 0 
0 A 
0 0 


0 

0 

A 3 


(14) 


The requirement that det(Fy) = 1 leads to the conclu- 
sion that A has the value 

A = yffoJi- (15) 


2.3 Stress 

The concept of stress provides the basis for quantifying 
the transmission of mechanical force across a material 
surface. Suppose that a solid body is in a state of equi- 
librium under the action of applied forces. Consider 
the resulting distributed force per unit area acting on a 
smooth material surface that divides the solid into two 


parts. Denote a unit vector normal to the surface at any 
point by n. 

Next, imagine that the portion of the solid into which 
n is directed is removed. The force per unit area acting 
on the exposed surface is a vector-valued function of 
position, say t (n), where the notation implies that the 
result depends not only on the location of that point on 
the surface but also on the orientation of the surface 
at that point; this vector quantity is commonly called 
the traction or the stress vector. By considering smooth 
surfaces for which n = for k = 1,2,3 successively, 
we are led to an array of nine quantities at any material 
point that can be expressed collectively as a matrix, say 
<Jij. For example, traction on the surface with normal 
vector C] is written in component form as 

f(ei) = cruei + cri 2 e 2 + cri 3 e 3 . (16) 

The matrix cry represents the components of a tensor 
cr at each point in the material called the stress ten- 
sor. It relates the local normal vector directed outward 
from an arbitrary material surface passing through 
that point to the corresponding force per unit area 
transmitted across that material surface according to 

U(n ) = (Tijiti. ( 17 ) 

In order for the angular momentum of an infinitesi- 
mal material element to be conserved, the stress matrix 
must be symmetric; that is, cry = Uji. The tensor char- 
acter of stress is evident in (17), which identifies cry as 
a linear operator that, when applied to a direction n at 
a material point, yields the local traction transmitted 
across a surface passing through that point with local 
outward normal n. 

It was tacitly assumed in the foregoing discussion 
that stress is defined in a particular configuration of the 
solid body. If the configuration is the current deformed 
configuration, then stress is commonly called true 
stress or Cauchy stress. In the reference configuration, 
both the area and the orientation of any particular 
material surface element may be different, leading to 
a different but related description of stress called the 
nominal stress. This issue must be addressed directly in 
describing large deformation phenomena, but it maybe 
overlooked without significant error when deformation 
is small. 

The foregoing discussion has been based on trans- 
mission of force across a material surface interior to 
a solid body. The concept is also central to identify- 
ing boundary conditions on stress in the formulation 
of a boundary-value problem in solid mechanics. The 
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normal vector n = sin ae\ + cos « e % . The traction act- 
ing on this material surface is provided by the Cauchy 
relation (17) as 

tin) = cr 33 cos «e 3 . (18) 

The component normal to the inclined surface is the 
normal stress t ■ n = (733 cos 2 «. The component tan- 
gent to the surface is the shear stress t - (t ■ n)n with 
magnitude U 33 sin a cos a. 

3 Governing Equations 

With reference to the map in figure 2, the next step is 
to identify the equations comprising a boundary-value 
problem in the mechanics of solids. These equations 
provide a description of the response of the material 
relating stress and strain, the governing physical pos- 
tulate relating stress and motion, and the compatibility 
equation relating strain and displacement. For a more 
formal introduction to these equations, see the article 
on CONTINUUM MECHANICS [IV.26]. 

3.1 Material Behavior 


Figure 3 Traction induced by applied tension 
on an interior plane with normal n. 

Cauchy relationship (1 7) can be used to infer boundary 
conditions for components of stress from the known 
imposed surface traction. 

2.3.1 An Example of Stress and Traction 

Recall that the solid block depicted in figure 2 has 
undergone stretching in the 3-direction and equibiaxial 
contraction in the 1 - and 2 -directions, so as to conserve 
the volume of the material. This deformation is induced 
by a traction distributed on each material plane perpen- 
dicular to the 3-direction with a resultant force, say P, 
as illustrated in figure 3. The stretch ratio A in the 1- 
and 2 -directions is given in terms of the stretch ratio in 
the 3-direction, f/£o, in (15). 

The traction t{e 3 ) is necessarily distributed uni- 
formly over any plane perpendicular to the 3-direction. 
In particular, the traction tie 3 ) has a single component 
in the 3-direction of magnitude (733 = PH/Hq. This is 
the only nonzero component of true stress throughout 
the block. 

The state of stress is uniform throughout the block, 
including everywhere on the interior plane with unit 


The variety of specific equations that have been adopt- 
ed to describe the deformation of a material in re- 
sponse to applied stress is enormous and growing, 
driven by special applications, development of new 
materials, and increasingly stringent demands on pre- 
cision in the use of traditional materials. Nonetheless, 
there are some basic requirements for any proposed 
description of response to be admissible. Briefly stated, 
the principal restrictions are that response must be 
consistent with the laws of thermodynamics and that 
the response must be independent of the frame of 
reference assumed by the observer describing it. The 
simple models for material behavior included here are 
consistent with these restrictions. 

If all aspects of material response of a solid are iden- 
tical from point to point within the body, then the mate- 
rial constituting that body is said to be homogeneous. If, 
on the other hand, the response of a solid at any mate- 
rial point due to an arbitrary but fixed state of stress 
is invariant under certain rotations of the material, 
with the state of stress unchanged, then the orthogonal 
transformations relating these particular orientations 
to the original orientation are collectively termed the 
isotropy group. If the isotropy group includes all pos- 
sible rotations, then the material is said to be isotropic 
at that point; in common usage, a material is said to 
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be isotropic if it is isotropic at each point. Both homo- 
geneity and isotropy result in enormous simplification 
when dealing with boundary-value problems for solid 
bodies. 

3.1.1 Linear Elastic Material 

When used to describe material behavior, the term 
elastic usually implies that the deformation due to an 
applied stress cycle is reversible, repeatable, and inde- 
pendent of the rate of application of stress. Here, we 
also take it to imply that any material line will increase/ 
decrease in length when its temperature is increased/ 
decreased. Many elastic materials undergoing small 
strain exhibit a strain response that is linear in the 
applied stress or linear in the response to a temper- 
ature change T. The dependence of strain on stress at 
a material point in an isotropic linear elastic solid can 
be expressed compactly as 

£ij = 1 -^oy ” ^VkkSij + oiTSij. (19) 

The constant E > 0 is called Young’s modulus; it has 
physical dimensions of force per unit area, and it is the 
ratio of applied stress to induced strain in uniaxial ten- 
sion of any stable elastic material. The dimensionless 
constant v is called Poisson’s ratio; it is the ratio of the 
contractive strain transverse to the tensile axis in uni- 
axial tension to the extensional strain along the tensile 
axis; it has values in the range 0 < v < ^ for homo- 
geneous materials, but microstructures can be devised 
for which — 1 < v < ^ . The constant a is called the coef- 
ficient of thermal expansion; it is the extensional strain 
of any material line element in the solid per degree 
increase in temperature. 

Together, these three constants represent a com- 
plete description of the behavior of a homogeneous and 
isotropic elastic material. Any other material constant 
is necessarily representable in terms of all or some of 
these. For example, consider the ratio of the magnitude 
of an equitriaxial stress <xn = 022 = (733 = a v to the 
induced equitriaxial strain s v at constant temperature. 
If the terms in (19) are each contracted over the indices 
i and j with T = 0, it follows that 

3e v = 1 2V 3o-!,, (20) 

t 

where 3 e v is the volume change to lowest order in 
strain. Consequently, the bulk modulus of an isotropic 
elastic material is expressible in terms of E and v as 
E /(l - 2v). It is evident that if v = 2 , then any triaxial 


state of stress induces no volume change whatsoever; 
that is, the material is incompressible if v = 

Now suppose that the only nonzero component of 
stress is a shear component, say (712. Furthermore, sup- 
pose that the corresponding strain component £12 is 
expressed in terms of the actual shear strain yi2 as 
observed in (12). The form of (19) for this case then 
shows that the elastic shear modulus is 
012 E 


h = 


Y 12 2(1 + v) ' 


( 21 ) 


3.1.2 Elastic-Ideally Plastic Response 

This terminology is applicable to the description of 
deformation of polycrystalline metals in metal forming 
or the plastic collapse of metal structures. The central 
idea underlying this type of behavior is that the range 
of stress values over which a material responds elasti- 
cally is limited. For most metals, plastic deformation is 
insensitive to mean normal stress; it is a response to 
the total stress less the mean normal stress, called the 
deviatoric stress, which is defined as 

Sij — CTij — ^CTkkdij. ( 22 ) 

The limit of elastic behavior is expressed in the form 
of a surface in stress space, usually called a yield 
surface. An example of such a surface is 

< CTy, elastic response, 

= CTy, plastic flow, (23) 

> (Ty , inaccessible, 

where cr y > 0 is the yield stress or flow stress, usually 
the magnitude of stress at which the elastic limit is 
reached in a uniform bar subjected to uniaxial tension 
or compression. 

While the state of stress remains on the yield sur- 
face, plastic deformation can proceed without change 
in stress. This accounts for the phenomenon of plastic 
collapse, whereby metal structures appear to fail catas- 
trophically at a constant level of load. Perhaps the sim- 
plest description of ongoing plastic flow assumes that 
the strain rate is proportional to the stress rate, so that 
the response is rate independent. In general, the cur- 
rent strain at a material point in a plastically deforming 
material depends on the entire history of strain at that 
point. 


2 s ij s ij 


3.2 Stress Equilibrium 

In general, the state of stress varies from point to point 
in a deforming solid, giving rise to local gradients in 
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stress components. A gradient in stress across a small 
material element implies an imbalance in the force or 
moment on the element, so these gradients must be 
related in order to ensure equilibrium. To see how 
these gradients must be related, then, we begin by not- 
ing that, in the absence of any other applied forces or 
moments, the resultant of all surface tractions on the 
bounding surface S of a solid occupying volume V must 
be zero at equilibrium. When written as a surface inte- 
gral, this requirement is ideally suited for application 
of the divergence theorem, so that 

0 = J atjUidS = J diCTijdV. (24) 

Next, we observe that not only is the entire solid sub- 
ject to this requirement but so is every part of it. It fol- 
lows that a spatially nonuniform distribution of Cauchy 
stress CTij must satisfy the three conditions 

dm, = 0, j= 1,2,3, (25) 

pointwise throughout the deformed configuration of 
the solid. This requirement is called the stress equilib- 
rium equation. Once again, if the deformation is locally 
small, this condition can be imposed in the reference 
configuration without significant error. 

If the solid being considered were subjected to a grav- 
itational or electrostatic field, for example, then the 
left-hand side of the equilibrium equation (25) would 
include a distributed body force per unit material vol- 
ume to represent the influence of such a field. Similarly, 
if the deformation of the material occurred at a rate 
sufficient to induce inertial resistance of the material 
to motion, the right-hand side of (25) would include a 
term in the form of the local rate of change of material 
momentum per unit volume. 

3.3 Strain Compatibility 

The notion of a strain compatibility requirement arises 
from the fact that, for a spatially nonuniform deforma- 
tion, there are three independent displacement compo- 
nents at every point but six independent strain compo- 
nents. Given any distribution of the three displacement 
components, it is a straightforward matter to deter- 
mine the corresponding strain distribution; see (7) or 
(10), for example. On the other hand, given an arbi- 
trary distribution of each of six independent compo- 
nents of strain, it is not always possible to determine a 
displacement field from which that strain distribution 
can be deduced. The strain compatibility equations pro- 
vide restrictions on the strain distribution, essentially 
in the form of integrability conditions, which ensure 


that three geometrically realizable components of dis- 
placement can be determined from the six prescribed 
components of strain. 

The issue of strain compatibility is of central impor- 
tance for the solution of boundary -value problems for- 
mulated in terms of stress. When confronted with prob- 
lems such as these, the strain distribution that corre- 
sponds to the stress in question through a constitu- 
tive relationship must be a realizable deformation in a 
three-dimensional Euclidean space. 

4 Global Formulations 

In this section, methods that incorporate the local def- 
initions of stress and deformation introduced above, 
but that begin from a global physical postulate govern- 
ing behavior, are briefly introduced. 

4.1 The Principle of Virtual Work 

The principle of virtual work provides a gateway to 
understanding the deformation and failure of solid 
bodies from a broader perspective. Here, the subject 
will be briefly introduced in terms of how it applies to 
deformations under a small amount of strain, but its 
applicability extends to general deformation. Suppose 
that a solid body occupying the region X of space, with 
bounding surface S, is subjected to surface traction ti 
on part of the bounding surface S a and to imposed 
S u = S - S a . Let £j* be an arbitrary distribution of 
compatible strain arbitrary distribution of compatible 
strain throughout X that is consistent with the condi- 
tion that the corresponding displacement field satisfies 
u* = 0 on S u ■ The requirement 

[ (T ijtfj dX = [ t k u£dS (26) 

for every admissible sfj and u* then ensures that the 
stress distribution is an equilibrium distribution that is 
consistent with the assigned boundary values of trac- 
tion. The aspect of this statement that makes it remark- 
able is that the stress field and the deformation field 
incorporated in the statement are completely uncou- 
pled; the inference is indeed independent of material 
behavior. 

For example, consideration of equilibrium of the 
material throughout the solid could begin with the 
requirement (26). Then, by exploiting the arbitrariness 
of the kinematic field, one is led to the conclusion that 
the stress field satisfies (25) pointwise throughout X 
and is consistent with the imposed traction through 
the Cauchy relation (17). 
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As a route to another result central to the study of 
elasticity boundary-value problems, suppose that the 
stress distribution appearing on the left-hand side in 
(26) is expressed in terms of the actual strain in H 
via Hooke’s law (19) and that t k is the actual imposed 
boundary traction on S a - In addition, suppose that 
£*• = Ssij, a slight perturbation from the actual strain, 
and that u £ = Su k , a slight perturbation from the 
actual displacement on S a ■ In this case, the principle of 
virtual work implies that a functional of deformation, 
the potential energy 

= \ J (TijEtj dm - J t k u k dS, (27) 

is stationary under arbitrary small variations in the 
deformation Sui that vanish on S u . 

4.2 The Finite-Element Method 

Many computational techniques have been found to be 
useful in the field, but the one that has had the greatest 
impact is the finite-element method. The origin of the 
concepts underlying the method are thought to reside 
in early efforts to find approximate solutions of elliptic 
partial differential equations by enforcing these equa- 
tions in the so-called weak sense, through variants of 
the principle of virtual work. 

In the mechanics of solids, the basic idea of the finite- 
element method is to divide the domain of a continu- 
ous solid body into a finite number of subdomains, or 
elements. Within each element, particle displacement 
or some other mechanical field is assumed to be repre- 
sented by its values at a number of discrete points, or 
nodes, on the boundary of the elements, and an inter- 
polation scheme is adopted to extend the definition 
throughout each element. A model of material behavior 
is adopted, and the approximate deformation fields are 
required to abide by the principle of virtual work for 
arbitrary variations of the nodal displacements. This 
requirement generates as many equations governing 
the nodal values as there are nodes with unspecified 
values, and these are solved numerically. 

The method is ideally suited for implementation on a 
digital computer. A great deal is understood about the 
accuracy and convergence of numerical methods appli- 
cable to broad problem classes, and computational 
mechanics has assumed its place, along with analyti- 
cal mechanics and experimental mechanics, among the 
principal methodologies of the field. 



Figure 4 An elastic-ideally plastic cantilever beam. 

5 Selected Examples 

This summary is concluded with brief descriptions 
of results obtained by analyzing particular phenom- 
ena. These provide some sense of its range and, at 
the same time, introduce relatively simple results with 
broad implications for understanding the mechanics of 
solids. 

5.1 Plastic Limit Load 

The beam in figure 4 has length L and square b x b 
cross section; it is composed of an elastic-ideally plas- 
tic material with yield stress cr y . Its left-hand end is 
rigidly constrained, and its right-hand end is subjected 
to a force of increasing magnitude P in a direction 
transverse to the beam. The beam responds elastically 
until P reaches a level sufficient to induce plastic defor- 
mation at the section closest to the cantilevered end, 
the section that bears the largest bending moment. 
Plastic flow begins at the outermost portions of the 
section, where the elastic stress is largest in magni- 
tude. As P is increased further, plastic yielding spreads 
inward from both the top and bottom of the section 
until the two regions coalesce at the beam center plane. 
The material can then “flow” with no further increase 
in load; the prevailing load is the plastic limit load 

P L = a y b 3 /4L. (28) 

It is noteworthy that the value of the limit load is inde- 
pendent of the details of the elastic deformation that 
leads to collapse. This feature makes it possible to cal- 
culate P based only on the dimensions and the know- 
ledge that the cross section is fully plastic; thus, it is 
possible to determine or estimate limit loads without 
reference to intervening elastic deformation. 

5.2 Stress Concentration at a Hole 

The large linearly elastic plate of uniform thickness h 
in figure 5 contains a central hole of radius a » h with 
a traction-free edge. The plate is subjected to a uniform 
remote traction of magnitude cr^ along opposite edges. 
At points in the plate much farther than a away from 
the hole, the state of stress is cr n = <Too, with other 
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Figure 5 Tension of an elastic plate with a hole. 
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Figure 6 A simply supported elastic beam. 


stress components equal to zero. This stress distribu- 
tion is not consistent with the condition of zero traction 
on the surface of the hole and, as a result, the stress 
field is perturbed in the vicinity of the hole. Analysis of 
this linear elastic boundary-value problem leads to the 
result that the stress adjacent to the edge of the hole at 
0 = \tt or \tt is <xn = 3cra> with all other components 
equal to zero, and at 0 = 0 or tt is CT 22 = — cr, with 
all other components equal to zero. The ratio of the 
magnitude of the largest tensile stress component to 
the magnitude of the applied stress is called the elastic 
stress concentration factor , here equal to 3. 

5.3 Elastic Reciprocity 



Figure 7 Cross sections of tubes in torsion. 
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Figure 8 A simply supported column under axial load. 


5.4 Torsion of a Hollow Elastic Tube 

A hollow tube of elastic material is a common config- 
uration. It is used to transmit torque along the axis of 
the tube. The torsional stiffness of this configuration— 
the ratio of the torque transmitted to the relative rota- 
tion of cross sections a unit distance apart— depends 
strongly on details of the shape of the cross section. To 
illustrate this point, consider a tube like the one in fig- 
ure 7 with a continuous circular cross section of mean 
radius R and wall thickness t R. The torsional stiff- 
ness is 2rrpR 3 t. Then, if the cross section is cut along 
the length of the tube but is otherwise unchanged, an 
estimate of the torsional stiffness leads to the result 
| npRt 3 . By conversion of the initially closed section 
to an open section of the same cross-sectional area, the 
torsional stiffness is reduced by a factor of |( t/R ) 2 . 
If R/t « 10, then the factor is approximately 0.0033, 
a dramatic reduction due only to a change from a 
“closed” section to an “open” section. 


Consider the simply supported beam illustrated in fig- 
ure 6. Suppose a transverse force of magnitude Pi at an 
arbitrary section at x\ induces a transverse deflection 
62 at a second arbitrarily selected section at X 2 . In addi- 
tion, suppose that a transverse force of magnitude P 2 is 
applied at the section X 2 and that the resulting deflec- 
tion at section xi is <5i. The elastic reciprocal theorem 
then states that 

Pi<5 2 =P2<5 1 (29) 

for any pair of section locations. An interesting corol- 
lary is that if any force P applied at section Xi induces 
the deflection 5 at section X 2 , then application of that 
force P at section X 2 induces the same deflection S at 
section xi. 


5.5 Euler Buckling 

Suppose that the straight elastic column in figure 8, 
pinned at the left-hand end and constrained against 
deflection at the right-hand end, is subjected to an 
axial compressive load P. If the column is reasonably 
straight, it will support the load by undergoing uni- 
form compression to generate the appropriate stress 
to resist the load. But what about the stability of the 
configuration? Several criteria are available to assess 
stability. For example, it might be assumed that the 
load is never perfectly axial but, instead, has a slight 
eccentricity. In this case, we seek to identify the lowest 
value of P for which the transverse deflection will be 
indefinitely large, no matter how small the eccentricity. 
Alternatively, if the load is perfectly aligned, it might 
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Figure 9 A rigid sphere indenting an elastic solid. 


be assumed that the column is given a slight transverse 
vibration. In this case, we identify the lowest value of P 
at which the vibration amplitude becomes indefinitely 
large, no matter how small the initial magnitude. Yet 
another criterion of stability is based on energy com- 
parisons, an idea drawn from thermodynamics. In this 
case, the lowest value of P is sought for which the total 
energy of the system decreases when the straight col- 
umn is given a slight perturbation in shape. Application 
of any of these criteria leads to the Euler buckling load 

P CT = n 2 Eb 4 /12L 2 (30) 

with simply supported end constraints. 


5.6 Hertzian Elastic Contact 


Consider a smooth, nominally rigid sphere of radius R 
being pressed with force P into the plane surface of 
an isotropic elastic solid, as in figure 9. Initially, when 
P = 0, the sphere contacts the elastic solid at just one 
point on its surface. As the magnitude of P is increased, 
an area of contact between the sphere and the elastic 
solid develops; this area is circular due to symmetry 
and has radius a, say. Even though the elastic solid is 
linear in its response, the radius a does not increase in 
proportion to P. Instead, a increases in proportion to 
pt/3; 1 1 1( , j ncrease is sublinear because, as P increases, 
the contact area also increases, resulting in an apparent 
stiffening of the response. The depth of penetration 8, 
measured as displacement of the sphere from first con- 
tact, also increases nonlinearly with increasing values 
of P, with 5 varying according to 


8 = 


r 3 (i — v 2 )Pi 2/3 

U eVr \ 


(31) 


Again, the nonlinearity in response derives from the 
changing contact area, even though the material has a 
linear stress-strain response. Hertz contact theory has 
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Figure 10 Tensile loading of a cracked elastic plate. 

been remarkably useful in the field due to the combi- 
nation of its relative simplicity and its broad range of 
applicability. 

5.7 Elastic Crack 

An elastic plate is subjected to a uniform remotely 
applied tensile stress CTx,, as in figure 10. The plate is 
uniform except for an interior line that is unable to 
transmit traction from one side to the other, that is, 
a crack. Based on the results observed above for the 
case of a plate containing a hole within an otherwise 
uniform stress field, we might ask about the nature of 
the stress concentration in this case. Recognizing that 
the region near the edge of the crack is essentially a 
2tt wedge, the configuration lends itself to separation 
of variables in local polar coordinates. The dominant 
feature in such a solution shows that the stress com- 
ponents vary with position near the end of the crack 
according to 

(Tij « — Aij(0) as r - 0, (32) 

J yJ2nr J 

where Aij(-O) = Aij{6) and K is an amplitude called 
the elastic stress intensity factor. For the configuration 
considered here, K = cr^^Tta. The stress singular- 
ity should not be viewed literally. Instead, it is known 
that a stress field of this form surrounds the nonlinear 
crack edge region both in small laboratory samples and 
in large structural components. It is this feature that 
accounts for the wide use of elastic fracture mechanics 
for characterizing structural integrity. 

5.8 Tensile Instability 

Deformation of the block illustrated in figure 2 was dis- 
cussed in section 2.2.2, and the state of stress driving 
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that deformation was considered in section 2.3.1. Sup- 
pose now that the stretch ratio A 3 in the 3-direction is 
related to the Cauchy stress CT 33 acting on each mate- 
rial plane perpendicular to the 3-direction according to 
(733 = A( A 3 - l) 1/n , where A > 0 is a material constant 
and n > 1. Then, for any stretch A 3 > 1, the total force 
acting on each cross section perpendicular to the 3- 
directionis F(A 3 ) = o^FqA 2 . Incompressibility implies 
that A 2 = Aj 1 . Under these conditions, the slope of 
the resulting force versus stretch relationship becomes 
zero at a stretch ratio of A 3 = n/(n - 1), and it is neg- 
ative for larger stretch ratios. The implication is that 
the load-carrying capacity of the tensile bar has been 
exhausted at that stretch ratio and, beyond that point, 
deformation can proceed with no further increase in 
load. This example illustrates the phenomenon of ten- 
sile instability, which is often associated with the onset 
of localized “necking” in a bar of ductile material under 
tension. 
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IV.33 Soft Matter 

Randall D. Kamien 


1 Introduction 

What makes matter soft? It is easily deformed but, 
consequently, it can easily self-assemble: sauce for the 


goose... Maybe it should be called “robust.” Why is 
this article called “Soft Matter,” then? Perhaps a better 
term would be “la matiere molle," as used by Pierre- 
Gilles de Gennes. Soft matter typically refers to sys- 
tems where entropy dominates energetic effects. In the 
case of hard materials, the energy scale is electronic. 
Let us note some energy scales: room temperature, 
around T = 300 K, corresponds to a thermal energy 
(the product of temperature T and Boltzmann’s con- 
stant fcs (1.381 x 10 -23 J KT 1 )) of jq electron volts. Com- 
pared with binding energies in atoms and molecules, 
which are a few electron volts, the thermal energy is 
tiny. Of course, the precise division between soft and 
“hard” materials is more subtle, but this serves as a rule 
of thumb. In this article we will touch on some of the 
classic soft materials and discuss statistical mechani- 
cal effects. Quantities like pressure, forces, and energy 
are all to be considered Boltzmann-weighted statistical 
averages. We will in turn consider point-like particles 
(colloids), line-like objects (polymers), two-dimensional 
sheets (membranes), and finally the three-dimensional 
continuum theories of liquid crystals. 

2 Colloids 

Colloidal suspensions consist of microscopic particles, 
the “colloids,” suspended in a fluid, often water. Ide- 
alized, the colloids are taken to be absolutely rigid, 
incompressible spheres that interact exclusively via 
excluded volume ; that is, their only interaction is that 
they cannot overlap in space. For spheres of radius R, 
this could be modeled as a pair potential U(r) tak- 
ing the value 0 when r ^ 2R and 00 when r < 2 R. 
At first glance, because the energy scale is infinite it 
might seem like this is not a soft system. We define 
the entropy of any collection of colloidal particles as 
S = fcs InO, where Q measures the volume of the con- 
figuration space of the colloids. Because the energy of 
any allowed finite-energy configuration remains 0 , the 
free energy, defined as F = E - TS, is entirely entropic. 
Indeed, the problem of hard spheres reduces to a com- 
binatorics problem: how many configurations of N hard 
spheres of radius R are possible at fixed volume V? 

2.1 The Equation of State 

The relation among pressure p, temperature T, and 
density n = N IV is known as the equation of state. 
In the case of point particles, the equation of state is 
the famous ideal gas law, p = nk%T, where kg is Boltz- 
mann’s constant defined above. The ideal gas law is 
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accurate for point particles, but what happens for par- 
ticles of finite radius? There is an expansion in powers 
of the volume fraction (p = nvo, where vo = is 

the volume per particle. Known as the virial expansion, 
its first few terms are 


, ^ = l + 4(p + 10(p 2 + 1836(f) 3 
nk^T 

+ 28.22 (p 4 + 39.82c p s + 53.34<j> 6 + ■ ■ ■ . 


A simple but brilliant approximation to this is the 
Carnahan-Starling formula: 

p 1 + (p + <p 2 - (p 3 
nk%T (1 - (p ) 3 

= l + 4(p + 10(p 2 + 18 (p 3 

+ 28 (p 4 + 40(p s + 54 (p 6 + ■ ■ ■ , 


which has integer coefficients that are remarkably close 
to the hard-won virial coefficients. This is a highly com- 
pact and highly accurate relation for low to modest 
( p . Note, however, that the Carnahan-Starling formula 
suggests that the pressure diverges at volume frac- 
tion (p = 1. This is certainly not the case, though, 
since the maximum packing fraction of hard spheres 
is (p = TTy/2/6 ~ 0.74. 

The virial expansion also explains a fluctuation- 
induced force known as the depletion interaction. Trun- 
cating at second order gives 


V 


k$T(n + 4vq n 2 ) » 


Nk B T 

V - i ( 8 Nvq)' 


Comparing with the ideal gas law p = Nk%T/V, we see 
that the total volume of the system has been reduced 
by 41V uo. a measure of the excluded volume from the 
other spheres. Note that a hard sphere of radius R 
excludes a volume of radius 2 k; a second sphere can- 
not have its center anywhere in that volume and so 
each isolated sphere occupies or creates an excluded 
volume of 8vo. Generalizing this, Asakura and Oosawa 
argued that two inclusions in a purely entropic fluid 
will attract in order to decrease the excluded volume 
or, equivalently, increase the free volume. Consider, for 
instance, two marked spheres with volume vq in the 
colloidal solution. Since each sphere excludes a vol- 
ume of 8vo, the free volume available to the remaining 
spheres is reduced by 16uo when the marked spheres 
are far apart. However, if the two spheres are brought 
together, their excluded volumes can overlap and it fol- 
lows that the available volume will increase, reaching its 
maximum when the two spheres touch, at which point 
their excluded volume is \vq. The change in the ideal 


gas law suggests a free energy of the form 
F = -fc B TlVln[(V - V e ) /IV], 

where V e is the excluded volume. 

If large spheres of radius Ri are placed in a col- 
loidal solution of small spheres of radius Rs, then each 
excludes a volume of in(Ri + Rs) 3 ', when the two large 
spheres touch, the extra free volume is jTtKjOKl + 
2Rs). For r = Rl/Rs » 1, this implies an entropic 
free energy gain of ^k^Tr(ps and so the depletion 
force is proportional to the volume fraction of small 
colloids, (ps- 

2.2 Packing of Hard Spheres 

At the other extreme, we have close-packed crystals, 
in which spheres touch neighboring spheres. In two 
dimensions, the triangular lattice, where each disk 
touches six adjacent neighbors, has an area fraction of 
7T/V3. In three dimensions, the densest close packing 
consists of layers of two-dimensionally close-packed 
spheres stacked on top of each other, with the spheres 
on one layer fitting into the pockets on the next layer. 
Because there are two equivalent but different sets of 
pockets, there are an infinite number of degenerate 
close packings. It is standard to label the first layer 
A and then label the second layer B or C to indi- 
cate on which set of pockets the next layers sit. Thus, 
for instance, the face-centered cubic (FCC) lattice is 
ABCABCABC . . . , while the hexagonal close packed lat- 
tice is ABABAB Both of these configurations have 

volume fraction Tr/yi8 ~ 0.74, as does any other 
sequence of packing. A random sequence of As, Bs, 
and Cs could be called random close packed, but it 
should not (see the next subsection). Away from this 
close packing, for densities near (prcc = tt 1^/18, Kirk- 
wood introduced “free volume theory” to calculate the 
equation of state. The volume fraction can be reduced 
by increasing the volume of the sample or, completely 
equivalently, by shrinking the spheres. Consider break- 
ing space up into cells of equal volume, each of which 
contains one lattice site through the Voronoi tessella- 
tion. Because the spheres have been slightly shrunk, 
they all have room to jiggle about in their cells. By con- 
struction, if each sphere is rigorously kept inside its 
cell, then there are no overlaps and there is no inter- 
action energy. The entropic contribution is again just 
the logarithm of the free volume, the volume available 
to the center of each sphere. For the FCC lattice, the 
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pressure is 

_ SF _ nk^T 
V dV NJ 1 - (4>I4 > Fcc) 1/3 
Note that this expression diverges at <p = </>fcc> as 
it must, since the lattice cannot be compressed any 
further. This is in contrast to the low-volume-fraction 
Carnahan-Starling formula that fails to capture close 
packing. This is not a surprise, since it is known 
that there is a discontinuous or first-order transition 
between the hard-sphere fluid and solid at a volume 
fraction around <f> » As a result, it is unreasonable 
to hope that the analytic behavior of the fluid phase 
should carry over to the solid phase. 

2.3 Is There “Random Packing”? 

Though it is ill-defined from a mathematical perspec- 
tive, there is a notion of “random close packing.” Unlike 
the random A, B, C stacking that one might consider 
for the densest sphere packings, this notion of random 
is associated with the densest fluid state, an arrange- 
ment of spheres with no particular translational sym- 
metry but that cannot be packed anymore densely. This 
is not precise, as local rearrangements can be made 
to increase the density without apparently generating 
long-range crystalline order. Despite this, numerous 
experiments and numerical simulations have given a 
value <fi rcp » 0.64 ± 0.02 repeatedly and reliably; for 
example, this corresponds to pouring marbles into a 
bucket or packing sand together haphazardly. Why this 
number is apparently universal and whether it can be 
defined precisely remain open questions. 

3 Polymers 

Polymers are everywhere : from the plastics on which we 
sit, that cocoon us in our cars and airplanes, and that we 
eat as thickening agents, to proteins, ribonucleic acid, 
and the all-important informatic deoxyribonucleic acid 
(DNA) molecules. DNA, in particular, is a linear poly- 
mer, without the branches and junctions that are com- 
mon and often uncontrollable in synthetic polymers. 
Biopolymers are, in general, much more uniform in 
length and structure ov\4ng to the magnificent molec- 
ular machinery of the cell. As a result, they are often 
the subject of study not for their biological properties 
but rather because of their purity. 

3.1 Random Walks 

How does one model a polymer? One can start with 
a microscopic model with chemical bonds connecting 


sections of molecule that are free or almost free to 
rotate about the bond axis (for single bonds only!). How- 
ever, the elasticity of long rods has a universal behavior 
that allows us to avoid microscopic details. We consider 
a simple bending energy written in terms of the unit 
tangent vector t: 



where k%TL v is the bending modulus written in terms 
of the temperature T and a length known as the per- 
sistence length, I p , which may be temperature depen- 
dent. The probability of the tangent pointing along ti 
at si given t = to at so is given by the functional inte- 
gral (with measure [df ] over all functions satisfying the 
boundary conditions) 

P(h,si\io,so) = f exp{-£/(k B r)}[df], 

Jto 

and it satisfies the diffusion equation on the sphere. 
This probability distribution implies that the autocor- 
relation function, or thermal average (denoted by (■)), 
is denoted by (f(s') ■ t(s)) = e -15 '^ 1 ^, so that, at dis- 
tances longer than the persistence length, the tangent 
vectors are decorrelated. It is therefore only a matter 
of length scale before a long linear object will behave 
as if it is composed of independent “freely jointed” ele- 
ments of length on the order of L p . Indeed, integrat- 
ing the tangent autocorrelation function, we find that 
the mean-squared end-to-end displacement for a chain 
of length I is ([R(L) - R(0)] 2 ) = 2 L p I for I » I p . 
2L P is known as the Kuhn length, the length of each 
independent element. 

3.2 Self-Avoiding Walks and Flory Theory 

Polymers differ from random walks in one critical and 
consequential way: each element of the polymer chain, 
a monomer, cannot occupy an already occupied region 
of space. The polymer is a self-avoiding walk. There are 
sophisticated methods in statistical mechanics based 
on lattice models of polymers, where self-avoidance 
translates into no more than single occupancy of any 
site. However, a clever argument, due to Paul Flory, 
allows us to estimate how the mean-squared end-to- 
end displacement of the polymer chain scales with the 
chain length, L. A measure of the size of the polymer in 
three dimensions is the radius of gyration R g , defined 
through the average 

R l = zf 0 R 2 WM- 
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Alternatively, the mean-squared displacement is six 
times the radius of gyration squared, 

{[R(L) - R(0)] 2 ) = 6Rj, 

so both will scale in the same way. Both lengths govern 
the hydrodynamics and light scattering of the polymer 
chains. 

For a chain of length L, the number of free “joints” 
or monomers is N = L/(2F P ). The standard “phan- 
tom random walk” or “Gaussian chain” has a Gaussian 
probability distribution, P(R), and so the free energy 
F = -k^TlnZ oc k$T(R 2 / (2N) - a(d - 1) InF), where 
of is a constant that depends on L p and d is the dimen- 
sionality of space. The logarithmic term arises from 
the measure factor after integrating over the isotropic 
probability distribution. This free energy is interesting 
in its own right; in vector form, 

F(R) oc k B T\R\ 2 /N, 


so the ground state has R = 0. Pulling the polymer away 
from its ground state gives a restoring force -VF oc 
-TR/N, a Hookean spring with a stiffness that scales 
as T /N. The elasticity of rubber arises from entropy; a 
heated rubber band shrinks because T grows. 

To this free energy, Flory added a term to account 
for self-avoidance. In a similar vein to the notion 
of excluded volume and entropy loss, each time a 
monomer overlaps with another monomer, their mutu- 
al steric hindrance lowers the number of orientational 
conformations, lowering the entropy and raising the 
free energy. To model this entropic interaction, which 
is proportional to /cgF, we write the average monomer 
density as p oc N/R‘ f Neglecting monomer-monomer 
correlations, the number of near misses is pN, so the 
Flory free energy reads 

[R 2 N 2 1 

F x kB7 |zN +V JI~‘ Xid ~ 1)ln *sJ’ 


where v > 0 is also some constant. Minimizing F by 
varying R g for fixed N gives 


0 = 
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vd 
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<x{d - 1) 

—Re 


'g ''g 

One might be concerned that the two arbitrary con- 
stants v and of would make any prediction useless, but 
in the long-polymer limit N — co , and it is only a matter 
of balancing two of the three terms with each other. One 
finds that R g oc ^IOt+2) f 0 r d < 4 and that R g oc N 1/2 
for d > 4. In the dimension d = 4, known as the upper 
critical dimension, all terms contribute and the more 


precise statistical mechanical models predict logarith- 
mic corrections to the scaling behavior. The Flory expo- 
nent vp = 3/ (d + 2) for d < 4 is known to be exact when 
d = 2 and differs by a small amount from the more pre- 
cise prediction v a 0.588 compared with vp = 0.6 when 
d= 3. 

Above four dimensions, Flory’s result agrees with the 
standard random walk. To see this, note that a standard 
random walk has a fractal or Hausdorff dimension of 
2. In more than four dimensions, two two-dimensional 
sets do not generically intersect and so one would 
not expect any correction to the random-walk scaling. 
Finally, though it may be tempting to view the large-JV 
limit as a saddle point in some sort of steepest-descent 
calculation, there is not, at the current time, any known 
systematic expansion for vp in inverse powers of N. 

Note that this discussion presumes that the polymer 
dissolves well in the surrounding solvent, known as a 
theta solvent when this condition is met. On the other 
hand, if the polymer self-attracts more strongly than it 
dissolves, R g ~ I 1/3 . The transition from the swollen 
state to the compact conformations is known as the 
theta point. 

3.3 Polymer Melts: Where Self Is Lost and 
Self-Avoidance Vanishes 

In solution, we may consider Al p polymers at number 
density n = N p /V = c/N, where c is the monomer 
density. In the previous sections we considered single 
polymers in solution. This is applicable whenever the 
polymer volume fraction 0 = n^nR g = c^nR^/N < 1, 
that is, when it is in the regime in which the polymers do 
not overlap. Because the monomers of different poly- 
mers can commingle, it is possible to consider concen- 
trations c > c* = 3 AT / (4ttF|), the overlap concentra- 
tion. It is essential to note that this concentration is 
lower than would be expected from Gaussian chains 
because of self-avoidance: c* ~ <p* ~ N 1 ^ 3v , where 
R g ~ N v . Fortunately, the c » c* regime is amenable 
to scaling analysis. 

The polymer melt regime corresponds to the com- 
pletely' incompressible limit where the volume fraction 
<p = 1 everywhere. Note that this regime can be attained 
through increasing either c or JJ g ; as a result, we can 
also consider the less dense, semidilute regime in which 
the volume fraction <p is both much smaller than 1 
and much larger than <p* = c*(2F p ) 3 . Consider a spe- 
cific chain in this regime and pick a monomer on it. In 
this regime we can define a correlation length scale or 
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“mesh size” § that is roughly the distance over which 
a single polymer segment does not interact with other 
segments. Note that § cannot depend on the degree of 
polymerization N for long polymers since the interac- 
tion with some other chain might as well be an interac- 
tion with a distant point of the same chain. By dimen- 
sional analysis, § = R g f(4>*/<fi) for some function 
/(■)■ However, for 4>* /<fi small, the N dependence in 
R g must be canceled by the N dependence in <p*, and 
so f(x) = x m must be a simple power law. As R g ~ N v 
and <p* ~ JV 1_3v , it follows that v + m(l - 3v) = 0 or 
m = — v/(l - 3v). 

Each of these correlation “blobs” has g monomers 
of a single swollen polymer, and so § = (2 L p )g v 
and thus the number of such blobs is b = N/g = 
AT/[(5/2Ip)] 1/v . 

Since we have already accounted for chain-chain 
interaction, a polymer in the semidilute regime can be 
thought of as a string of b uncorrelated blobs with- 
out self-avoidance, so that the radius of gyration in the 
semidilute regime is 

Ricf) = %b 112 ~ g(2v-l)/2v Jv l/2 _ Jv l/2 < ^-(2v-l)/(3v-l)_ 

We see that in this regime the end-to-end distance of 
the polymer scales as 1V 1/2 , though there is a remnant 
volume fraction dependence. Physically, in a good sol- 
vent, v must be at least \ since the phantom chain sets 
a lower bound, and, if the original polymers were phan- 
tom chains, v = ^ and the <p dependence vanishes. 
Note also that in the melt regime, where cp = 1, we 
recover the random walk result. Typically, v is taken 
to be the Flory value of vp = |, which gives R(cp) a 
(f >~ 1/4 dependence. Again, this exponent is v\dthin a few 
percent of more precise calculations. 

The dynamics of chains in the dilute, semidilute, and 
melt regimes continues to be an active area of research. 

4 Membranes and E mulsions 

So far we have discussed point-like objects (colloids) 
and one-dimensional objects (polymers). It is natural 
to move forward to the description of two-dimensional 
objects. These take two forms in soft matter. First, 
they exist as membranes such as the lipid bilayers in 
cell walls, soap films and bubbles, and freely float- 
ing polymerized sheets, ranging from highly ordered 
ones, such as graphene, to highly disordered ones, 
such as the spectrin networks of red blood cells. The 
second, more abstract, notion of a two-dimensional 
object is an interface. Emulsions are made by mixing 


two incompatible fluids together, along with a surfac- 
tant that allows them to mix on a supermolecular scale. 
The interface between the “oil” phase and the “water” 
phase, laden with surfactant, is also a surface. Unlike 
its bilayer cousin, however, the two sides of this mem- 
brane can be distinguished. This asymmetry can change 
the structure of the ground states, as we will discuss 
below. 

4.1 The Young-Laplace Law 

When we have lipid bilayers or surfactant monolayers 
in equilibrium with their single molecules in solution, 
there is a surface tension that acts as the chemical 
potential for area. In soap films, for instance, the soap 
molecules are in equilibrium between the surface and 
the fluid. There is another limit for these fifms when the 
individuaf mofecufes stretch, but we do not consider 
this high-tension, nonlinear limit here. 

Consider a vesicle or bubble with total volume V and 
a constant surface tension y. The variation of the free 
energy at fixed volume is 

d F = -Ady - p dV. 

Allowing the area to fluctuate gives the Gibbs free 
energy G = F+ yA, so dG = y d A- p dV. In equilibrium, 
dG = 0 and we have p = y(dA/dV). How does the area 
vary with the volume? This is just 2 H, twice the mean 
curvature. We are led to the Young-Laplace law, 



which relates the radii of curvature Ry and R 2 to the 
pressure difference between the inside and the outside 
of the bubble. We have to be careful with signs, how- 
ever. Recall that the relative signs of the radii of curva- 
ture are fixed by geometry, but the overall sign is not. 
To fix the sign we pick the outward normal to the sur- 
face and measure with respect to that. Fortunately, this 
also gives us a sense to measure the pressure differ- 
ence between inside and outside. It follows that, if two 
bubbles are in contact with each other, the one with 
higher pressure bulges into the lower-pressure bubble, 
which fortunately agrees with intuition. For spherical 
bubbles this means that the smaller bubbles bulge into 
the larger bubbles. 

4.2 Helfrich-Canham Free Energy 

The free energy we have used to derive the shape of 
fluid membranes is not useful for studying local fluc- 
tuations of the surface. Helfrich and Canham proposed 
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the following free energy functional, quadratic in the 
inverse radii of curvature: 

F= k \ {kh2 + i<K)dA, 

where H is the mean curvature, K is the Gaussian cur- 
vature, and k and k are two independent bending mod- 
uli. (If applicable, this functional can be supplemented 
with a volume constraint.) In a one-component mem- 
brane, we would expect these moduli to be constants 
and, as a result, the second term, J RK dA, reduces to a 
topological invariant for the membrane. 

There are two interesting embellishments of this 
energy. The first comes up whenever the two sides 
of the interface are different, in the case of an oil- 
water interface in an emulsion, for instance. Because 
the asymmetry identifies an “inside” and an “outside,” 
the ambiguity of the surface normal is no longer an 
issue. As a result, the sign of H is physical and terms 
linear in the mean curvature are allowed. Adding a con- 
stant to the free energy, we can write the free energy 
as 

F'= !J[ K(H-H 0 ) 2 + kK]dA , 

where Ho is the preferred mean curvature. This leads to 
bent ground states and, in combination with boundary 
conditions, can induce a variety of complex morpholo- 
gies and can be used to rationalize phases of diblock 
copolymers and lipid monolayers. 

Finally, note that i< , k, and Ho can vary over the mem- 
brane, either due to quenched-in or slow-moving impu- 
rities or because of composition variation in multicom- 
ponent systems. The latter— phase separation within 
a membrane and the subsequent localization of high 
or low curvature— plays an important role in biological 
processes such as budding and fusion. 

4.3 Tethered Membranes 

The membranes we have discussed are fluid, there is no 
order within the membrane and the molecules are free 
to flow. But a plastic sheet is not like this. The molecules 
have been glued in place, often through polymeriza- 
tion, and the connectivity and topology of the surface 
elements are no longer free to vary as they are in the 
liquid. As described in mechanics of solids [IV.32 §3], 
elastic deformations are captured via the strain tensor 
2uij = gij - Sij, where gy is the induced metric of 
the otherwise flat surface. Recall the standard elastic 
energy in terms of Uy, p, and A, known as the Lame 
constants, 

Pel = ' | IMuu) 2 + 2gUijUij] dA. 


However, adding this to the curvature energy is not as 
straightforward as it may seem! Indeed, Gauss’s theo- 
rem egregium relates the metric directly to the Gauss- 
ian curvature: K = R\ 22 ig, where Rijki is the Riemann 
curvature tensor and g is the determinant of the metric. 
In other words, Gaussian curvature requires in-plane 
strain; bending and stretching are necessarily coupled. 
To lowest order in Uy, 

K = 25i3 2 ui2 - dfu22 - n, 

and so the elastic and curvature deformations are 
coupled degrees of freedom. 

5 Liquid Crystals 

Is “liquid crystal” an oxymoron or simply an unfortu- 
nate name? Neither! Though there are many ways to 
characterize whether a material is liquid or solid, here 
we will use an approach based on broken symmetries. 
Recall that the gas and liquid phases join at their crit- 
ical point in the phase diagram, so, from a symmetry 
perspective, they are both the same, and we will term 
them fluids. A fluid does not support static shear: if one 
exerts a step shear strain on one surface of a fluid (arbi- 
trarily slowly), there will be no strain on the opposing 
surface. A crystal will support a step strain in all direc- 
tions. Similarly, apply a finite rotation to the top surface 
of a fluid and the bottom surface will not rotate along 
with it, while crystals will support torques. Liquid crys- 
tals are in between these two cases. Some can trans- 
mit torques and not shear, some can transmit shear 
but only in a reduced set of directions; all of them are 
interesting. 

5.1 Maier-Saupe Theory 

Obviously, liquid crystals are of special interest be- 
cause of their optical properties. They are typically 
rod-like molecules with differing dielectric constants 
f II along the long direction and perpendicular to the 
long direction; since it is only a direction, its length 
and sign are arbitrary. We thus choose a unit vector h 
defined up to sign to denote this direction. The dielec- 
tric tensor £y is written as the sum of an isotropic part 
and a traceless symmetric tensor Qy = SiriiUj - |<5y) 
constructed from the directions n with magnitude S: 

Sij = EA± ^5ij + (£,, - S±)Qij- 

S, known as the Maier-Saupe order parameter, charac- 
terizes the amount of anisotropic order. When S van- 
ishes, the system is optically isotropic. Since Qy is a 
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thermodynamic average over the directions v a of the 
individual molecules labeled by «, it follows that Qy = 
(v,Vj - Note that the average is over molecules 

and fluctuations. Multiplying both sides by n^rii and 
taking the trace yields 5 = |((n-v) 2 -j) = (Po( cosd)>, 
where P 2 ( ■ ) is the second legendre polynomial [11.29] 
and 0 is the angle between the molecular direction v„ 
and n. When 5 = 1 the long axis of every molecule is 
always along n; when 5 = — the molecules are all 
perpendicular to n, the so-called discotic phase. 

The transition from the isotropic phase with 5 = 0 
above T = T c to an ordered phase S 0 below T = T c 
is modeled via Landau theory with free energy density 

f = fo + a{T - T c ) TrQ 2 + PTrQ 3 + cTrQ 4 
= fo + §a(P - T C )S 2 + §fc5 3 + §c5 4 . 

Because in general Tr Q 3 0, this indicates a discontin- 
uous first-order phase transition between the nematic 
and isotropic states. 

Note that a 3 x 3 traceless symmetric tensor has five 
independent components (three angles and two inde- 
pendent eigenvalues), whereas only three parameters 
appeared in the discussion above. In general, 

Qij = Slnitij - |Sy] + Wlnumj - j(<5y - runj)], 

where m is a unit vector perpendicular to n and W mea- 
sures the biaxial order. There are, at present, only a few 
experimentally realized biaxial nematic phases, and we 
therefore focus on the uniaxial nematic ( W = 0 ) below. 

5.2 Frank Free Energy 

The Frank free energy is the basis for studying elastic 
distortions of the nematic liquid crystal ground state. 
Starting with the unit vector n, the director, and recall- 
ing that it is defined up to sign, we require a free energy 
invariant under n n. It should be noted that reduc- 

ing the description to a vector is not always possi- 
ble and requires that the director field be orientable. 
However, in orientable patches, we have the Frank free 
energy F n em[n] = [/ mm d 3 x, where f nem is the free 
energy density: 

/nem = jj{Ri[n(V • n )] 2 + K 2 [h ■ (V x n )] 2 
+ K 3 [(h ■ V)n ] 2 

+ 2 K 24 V ■ [(n ■ V)n - n(V ■ n)]}. 

The elastic constants K\, K 2 , K 3 , and K 3 4 are known, 
respectively, as the splay, twist, bend, and saddle-splay 
moduli. Note that each term has a precise geometric 
meaning: n(V • n) is twice the mean curvature vector 


of a surface with unit normal n; n ■ ( V x n ) measures the 
deviation from the Frobenius integrability condition of 
the director field; (n ■ V)n is the curvature of the inte- 
gral lines of the director field; and finally, the saddle 
splay is twice the negative of the Gaussian curvature 
of a surface v\ith unit normal n. The interpretations 
for the splay and saddle-splay break down, of course, 
whenever h ■ (V x n) ^ 0, though they still measure 
elastic distortions. Note that the saddle-splay term is 
a total derivative and will therefore not contribute to 
the extremal equation for the director. However, it will 
contribute at boundaries, including at the locations of 
topological defects. 

To study fluctuations around the ground state, for 
instance when the director is uniform along the z-axis 
ho = z, it is usual to expand n = z + Sh. Note that Sh 
is a vector in the xy-plane, since deviations of the unit 
director are necessarily orthogonal to z. In this case, 
the free energy in Fourier space 

Fnem = | |(2Tr)“ 3 5n I (<j)5n J (-q)Ay (q) d 3 q, 
where Ay is 

Ay = [Km 2 + K 3 q 2 z ]Plj + [K 2 q 2 ± + K 3 q 2 z ]Pj jt 

with Pjj = qtqj/ql and = <5y - Py the longi- 
tudinal and transverse projection operators, and with 
q± = qx% + q y y. 

5.3 Smectics 

The geometric interpretation of the Frank free energy 
becomes powerful in the smectic liquid crystal phase. 
This phase occurs at lower temperatures or higher den- 
sities than the nematic phase and, in addition to the 
director order already present in the nematic, the smec- 
tic has an additional one-dimensional, periodic, density 
modulation. To represent this, we construct a density 
fieldp = po + Pi cos[2n<p(x) /ao], where po is the back- 
ground density, pi is the smectic order parameter, ao 
is the lattice constant, and <fi(x) is the phase of the one- 
dimensional order. The ground state of the smectic has 
<p = k ■ x, where k is a unit vector along the periodic 
direction. When fc||n this is known as the smectic A 
phase, and when ( k ■ h)(k x h) ^ 0 it is known as the 
smectic C phase. Many other embellishments of order 
and periodicity are possible. 

We may interpret this system as being composed 
of layers, a distance a 0 apart, lying at the level sets 
of <p(x) = mao with m e Z. Restricting our discus- 
sion to the smectic A phase, the layer normal N = h, 
and we therefore write n = V<p/\\7<fi\. In this case 
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n ■ (V x n) = 0, and the terms in the Frank energy 
enjoy their full geometrical interpretation. The free 
energy vanishes identically for flat layers that are all 
parallel to each other. These layers, however, need not 
have a constant period; none of the terms in the Frank 
energy set the spacing. The smectic A free energy den- 
sity /sm = fc + /nem has an additional term, the com- 
pression energy, that will vanish when | V </) = 1; for 
instance, a term such as / sm = ^ B( | V</> j - 1 ) 2 will favor 
equal spacing. Again, fluctuations around the ground 
state <fi = z can be studied by writing <p = z - u and 
expanding to quadratic order in u and lowest nontrivial 
order in derivatives: 

F sm = \ J{B(a z u) 2 +Jfi(Viu) 2 }d 3 x. 

In this approximation, 5n = — Vj u. It should be noted 
that the nonlinear elasticity is essential for studying 
large deformations and, more importantly, fluctuations 
and topological defects. 

5.4 Cholesterics 

In some sense, cholesterics represent the opposite sit- 
uation in which the nematic director is nowhere inte- 
grable. To / ne m we add f* Qm = K 2 qoh ■ (V x h). Note 
that under spatial inversion, V — - V and changes 
sign. This term is therefore chiral and can only appear 
if the constituent molecules are not invariant under 
spatial inversion. The ground state of the cholesteric 
has a pitch axis, P, perpendicular to all the molecules 
and along which the molecular orientation rotates; for 
example, ho = (cosqoz.sinqoz,0) when P = z. Since 
the cholesteric has one-dimensional periodic order, 
the fluctuations are described by a smectic-like energy 
functional (F sm above), with B and AT replaced by com- 
binations of the Frank constants and qo. Writing a gen- 
eral deformation in terms of functions 0 and <p a s 
h = [sin0cos(^o2 + </>),sin0sin(qoZ + <fi),cos 0], u 
in the expression for Fsm is replaced by <p. 
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IV. 34 Control Theory 

Anders Rantzer and 
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1 Introduction 

Feedback refers to a situation where the output of a 
dynamical system is connected to its input. For exam- 
ple, a feedback loop is created if room temperature in 
a building is measured and used to control the heat- 
ing. Feedback is ubiquitous in nature as well as in engi- 
neering. Our body uses feedback to control body tem- 
perature, glucose levels, blood pressure, and countless 
other quantities. Similarly, feedback loops are crucial in 
all branches of engineering, such as the process indus- 
try, power networks, vehicles, and communication sys- 
tems. 

Control theory is a branch of applied mathematics 
devoted to analysis and synthesis of feedback systems. 
A wide range of mathematics is used, and the purpose 
of this article is to illustrate this. After some histor- 
ical notes are given in section 2, we introduce some 
basic control engineering problems in sections 3 and 4. 
Section 5 illustrates how the theory of analytic func- 
tions can be used to derive fundamental limitations 
on achievable control performance. Multivariable con- 
trol problems are discussed in section 6 using concepts 
from linear algebra and functional analysis. The impor- 
tant special case of linear quadratic control is pre- 
sented in section 7, together with the idea of dynamic 
programming. Extensions to nonlinear systems are dis- 
cussed in section 8. Finally, section 9 describes some 
current research challenges in the area of distributed 
control. 

2 A Brief History 

The use of feedback control in engineering dates back 
to at least the nineteenth century. A prime example is 
the centrifugal governor— an essential component in 
the Watt steam engine— the basic principles of which 
were discussed by J. C. Maxwell in his 1868 paper “On 
governors.” From this early period until World War II, 
almost a hundred years later, control technology was 
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developed independently in different branches of engi- 
neering, such as the process industry, the aerospace 
industry, and telecommunications. 

It was not until the 1940s and 1950s, stimulated 
by the war effort, that the scientific subject of control 
theory was created and the main ideas were collected 
into a common mathematical framework, emphasizing 
a frequency-domain viewpoint of the subject. Several 
important new ideas, such as the Nyquist criterion and 
the Bode relations, came out of this process. 

A second wave of progress, starting in the 1960s, 
was stimulated by the space race and the introduc- 
tion of computer technology. Theories for optimal con- 
trol were created, with fundamental contributions by 
L. P. Pontryagin and V. A. Yakubovich in Russia and 
R. Bellman and R. Kalman in the United States. Unlike 
earlier work, these new contributions to optimal con- 
trol theory emphasized time-domain models in terms 
of differential and difference equations. 

Between 1980 and 2010 two key words came to dom- 
inate the research arena: robustness and optimization. 
The need to quantify the effects of uncertainty and 
unmodeled dynamics led to new tools for robustness 
analysis and -optimal control synthesis. Further- 
more, efficient new algorithms for convex optimization 
are increasingly being used for the analysis, synthesis, 
and implementation of modern control systems. 

3 A Simple Case of Proportional Feedback 

Many of the basic ideas of control theory can be under- 
stood in the context of linear time-invariant systems. 
The analysis of such system is considerably simpli- 
fied by the fact that their input-output relationship is 
completely characterized by the response to sinusoids. 
A general frequency-domain representation with input 
u(t) and output y it) takes the form 

Y(s)=P(s)U(s), (1) 

where 

r CO r CO 

Y ( 5 ) = e~ st y(t)dt and U(s)= e~ st u(t)df. 
Jo Jo 

The function P(s) is called the transfer function. In 

steady state, a sinusoidal input u(t) = sin cot gives the 

sinusoidal output 

y(t) = !_P(itt>)| sinicot + argP(ico)). 

Of particular interest are complex numbers z and p 
such that P(z) = 0 and Pip) = 00. These are called 
zeros and poles of P, respectively. 


Next, consider a simple feedback control law u(t) = 
k[r(t) - y(t)], where k is a constant and r is a refer- 
ence value for the output, such that the input u is pro- 
portional to the deviation between the output y and 
its reference value r. The control law in the frequency 
domain has the form 

U(s) = k[R(s)-Y(s)]. (2) 


T(s) 


Eliminating U from (1) and (2) gives Y(5) = T(s)R{s), 
where 

kP(s) 

1 + fcP(s)' 

In particular, we see that, if kP(s) is significantly bigger 
than 1, then T{s) ~ 1 and Y(s) ~ R(s). Hence, at first 
sight it may look like all that is needed to make the 
output y(t) approximately follow the reference r(t) is 
to apply the feedback law (2) with sufficiently large gain 
k. There are, however, several complicating issues. 

A central issue is stability. The input-output relation- 
ship (1) is said to be stable if every bounded input gives 
a bounded output. It turns out that in the represen- 
tation U(s) = Jo“ e~ st u(t) dt, a bounded u(t) corre- 
sponds to a U(s) that is bounded and analytic in the 
right half-plane {5 | Re 5 > 0}. Similarly, the input- 
output relationship described by the transfer function 
P ( 5 ) is stable if and only if P ( 5 ) is bounded and analytic 
in the right half-plane. 

To conclude, feedback interconnection of a stable 
process Y(5) = P(s)U(s) with the controller U(s) = 
k[R(s) - Y(5)] gives a stable closed-loop map from 
the reference signal R to the output Y if and only if 
kP(s)/(l + kP(s)) is analytic and bounded in the right 
half-plane. However, as we will see in the next sec- 
tion, there are also degrees of stability and instabil- 
ity. This will be quantified by considering the effects 
of disturbances and measurement errors. 


4 A More General Control Loop 

We will now discuss a more general control structure 
like the one illustrated in figure 1. The process is still 
represented by a scalar transfer function P(s). 

The controller consists of two transfer functions, the 
feedback part C(s) and the feedforward part F(s). The 
control objective is to keep the process output x close 
to the reference signal r, in spite of a load disturbance 
d. The measurement y is corrupted by noise n. 
Several types of specifications are relevant: 

(I) reduce the effects of load disturbances, 

(II) control the effects of measurement noise, 

(III) reduce sensitivity to process variations, and 

(IV) make the output follow command signals. 
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Figure 1 A control system with three external signals: 
the reference value r(t), an input disturbance d(t), and a 
measurement error n(t). 


A useful synthesis approach is to first design C(s) 
to meet specifications (I), (II), and (III), and then design 
F(s) to deal with the response to reference changes, 
(IV). However, the two steps are not completely inde- 
pendent: a poor feedback design will also have a nega- 
tive influence on the response to reference signals. 

The following relations hold between the frequency- 
domain descriptions of the closed-loop signals: 


X(s) = 


V(s) = 


Y(s) = 


PCF 
1 + PC 


R(s) - 


CF 

1 + PC 


R(s) - 


PCF 
1 + PC 


R(s) + 


PC 

1 + PC 

c 

1 + PC 
1 

1 + PC 


N(s) + 
N(s) + 
N(s) + 


P 

1 + PC 

1 

1 + PC 
p 

1 + PC 


D(s), 

D(s), 

D(s). 


Note that the signals in the feedback loop are charac- 
terized by six transfer functions: 


P(s)C(s)F(s) 
l + P(s)C(s) ’ 


P(s)C(s) P(s ) 

1 + P (s)C (s) ’ l+P(s)C(s)’ 


C(s)F(s) C(s) 

l+P(s)C(s)’ l+P(s)C(s)’ 


1 

1 +P(s)C(s)' 


To fully understand the properties of the closed-loop 
system, it is necessary to look at all these transfer 
functions. (In particular, they all have to be stable.) It 
can be strongly misleading to show only properties of 
a few input-output maps, e.g., a step response from 
reference signal to process output. 

The properties of the different transfer functions 
can be illustrated in several ways, by time responses 
or frequency responses. For a particular example, 
in figures 2 and 3 we show first the six frequency 
response amplitudes, and then the corresponding six 
step responses. 

It is worthwhile to compare the frequency plots and 
the step responses and to relate their shapes to the 
specifications (I)-(IV). 


(a) TF (b) T (c) SP 



10 _1 10 ° 10 1 10 _1 10 ° 10 1 10 _1 10 ° 10 1 


Figure 2 Frequency response amplitudes for P(s) = 
(5 + 1) -4 and C(i) = 0.8(0.55 -2 + l) when TF = (0.55 + 1)- 4 . 
Here, the notation S = (1 + PC) -1 and T = PC(1 + PC) -1 
is used. 
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Figure 3 Step responses for P(s) = [s + 1) 4 and 
C{s) = 0.8(0.55- 2 + 1) when TF = (0.55 + 1)“ 4 . 


(I) Disturbance rejection 

Parts (c) and (f) of figure 2 show the effect of the dis- 
turbance d in process output x and input v , respec- 
tively. The resulting process error should not be too 
large and should settle to zero quickly enough. Corre- 
sponding step responses are shown in parts (c) and (f) 
of figure 3. 

(II) Suppression of measurement noise 

Figure 2(b) shows good attenuation of measurement 
noise above the “cutoff” frequency of 1 Hz (which 
in this example is mainly an effect of the process 
dynamics). 

(III) Robustness to process variations 

The robustness to process variations is determined 
by the sensitivity function S = (1 + PC)" 1 and the 
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complementary sensitivity function T = PCI 1 + PC)' 1 . 
In fact, the closed-loop system remains stable as long 
as the relative error in the process model is less than 
|r| _1 . Most process models are inaccurate at high fre- 
quencies, so the complementary sensitivity function T 
should be small for high frequencies. 

(IV) Command response 

Figure 3(a) shows how the process output x responds 
to a step in r. Using the prefilter F(s ), it is possible 
to get a better step response here than in part (b). The 
price to pay is that the corresponding response in the 
control signal gets higher amplitude. 

5 Fundamental l imitations 

Clearly, the six transfer functions discussed in the pre- 
vious section are not independent. In particular, the 
following identity holds trivially: 

S(i(u) + T(io>) = 1. 

The first term describes the influence of load distur- 
bances on process input. This should be small. The 
second term describes robustness to model errors and 
the influence of measurement noise. Ideally, this term 
should also be small, so controller design involves a 
trade-off between the two requirements. The essence 
of the conflict is that disturbances cannot be rejected 
unless measurements can be trusted. Usually, this is 
resolved by frequency separation: measurement noise 
is typically high-frequency dominant, so T( ito) should 
be small at high frequencies. This makes it impossible 
to remove the effects of fast load disturbances at high 
frequencies, but we can still cancel them on a slower 
timescale by making S(icu) » 0 at low frequencies (see 
figure 4). 

A popular approach to control synthesis, known as 
loop shaping, is to focus on the shape of the loop trans- 
fer function and to keep modifying the controller C 
until the desired shape of PC is obtained. However, a 
seriously complicating factor in loop shaping is the sta- 
bility requirement, which mathematically means that 
all six of the transfer functions in section 4 need to be 
analytic in the right half-plane. This restricts the pos- 
sibility to shape the transfer functions. In particular, 
Bode’s integral formula shows that the effort required 
to make the sensitivity function 5(ico) small is always 
a trade-off between different frequency regions: if P (s ) , 
C(s), and S{s) = (1 + P(s)C(s))~ l are stable and 



Figure 4 Magnitude specifications on T and S can (approx- 
imately) be interpreted as specifications on the loop trans- 
fer function P (ito) C(ico), which should have small norm at 
high frequencies and large norm at low frequencies. 

s 2 P(s)C{s ) is bounded, then it follows from cauchy’s 
integral formula [IV.l §7] for analytic functions that 

r 00 

j log 1 5 (ito) | dco = 0. 

Jo 

In the cases where there are unstable poles p,- in 
P(s)C(s), the integral formula changes into 

log | S (ito ) ] dco = tr^Repi 
i 

(see figure 5). Hence, unstable process poles make it 
harder to push down the sensitivity function! The faster 
the unstable modes, the harder it is. This can be used as 
an argument for why controllers with right half-plane 
poles should generally be avoided. 

To further illustrate fundamental limitations im- 
posed by plant dynamics, we will discuss the dynamics 
of a bicycle. A torque balance for the bicycle (figure 6) 
can be modeled as 

d 2 0 „ mV q£ ( d p\ 

j w = m g £0 + _^ + a _y 
The transfer function from steering angle /? to tilt angle 
6 is 

s mV pi as + Vp 
b Js 2 - mg{' 

This system has an unstable pole p with time constant 
p _1 = i Jj/mg-8 (» 0.5 seconds). 

Moreover, the transfer function has a zero z with 
z _1 = -a/Vo (~0.05 seconds). 

Riding the bicycle at normal speed, the zero is not 
really an obstacle for control. However, if one tries to 
ride the bicycle backward, Vo gets a negative sign and 
the zero becomes unstable. Such zeros are sometimes 
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Figure 5 The amplitude curve of the sensitivity function 
always encloses the same area below the level S = 1 as 
it does above it. This invariance is sometimes referred to 
as the “water bed” effect: if the designer tries to push the 
magnitude of the sensitivity function down at some point, 
it will inevitably pop up somewhere else! 



Figure 6 A schematic of a bicycle, (a) The top view and 
(b) the rear view. The physical parameters have typical val- 
ues as follows: mass m = 70 kg; rear-to-center distance 
a = 0.3 m; height over ground T = 1.2 m; center-to-front 
distance b = 0.7 m; moment of inertia J = 1 20 kg m 2 ; speed 
Vo = 5 m s -1 ; and acceleration of gravity g = 9.81 m s -2 . 


called “minimum-phase zeros,” and they further limit 
the achievable control performance. In particular, rid- 
ing the bicycle backward at slow speed (» 0.7 m s" 1 ), 
there is an unstable pole-zero cancelation, and the 
bicycle becomes impossible to stabilize. 

It is not hard to see that unstable open-loop dynamics 
puts a fundamental constraint on the necessary speed 



Figure 7 The step response for a process with transfer func- 
tion (l-i)/(5 + l) 2 . The response is initially negative, which 
is typical for processes with an unstable zero. 


of control: the feedback loop must be faster than the 
time constant of the fastest unstable pole. 

Unstable zeros also give rise to fundamental limita- 
tions. For a process with an unstable zero z, a step 
input generally yields an output that initially goes in 
the “wrong direction” (see figure 7). In fact, the defini- 
tion of an unstable zero shows that the step response 
y(t) must satisfy 

r OO 

0= e~ zt y{t) dt, 

Jo 

so y(t ) cannot have the same sign for all t. 

Think of the bicycle again. Riding it backward, we 
would operate with rear-wheel steering. Hence, when 
turning left, the center of mass will initially move to 
the right. 

The delayed response due to an unstable zero gives 
a fundamental limitation on the possible speed of con- 
trol: the feedback loop cannot be faster than the time 
constant of the slowest unstable zero. Similarly, the pres- 
ence of time delays makes it impossible to achieve fast 
control: the feedback loop cannot be faster than the time 
delay. 

Formal arguments about fundamental limitations 
due to unstable poles and zeros can be obtained using 
the theory of analytic functions. Recall that a controller 
is stabilizing if and only if the closed-loop transfer 
functions are analytic in the right half-plane. 

If p is an unstable pole of P(s), then the comple- 
mentary sensitivity function must satisfy T(p) = 1 
regardless of the choice of controller C. The fact that 
T has a hard constraint in the right half-plane also has 
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Figure 8 Limitation from unstable pole. The comple- 
mentary sensitivity function T{ ico) = P(ico)C(ico)/(l + 
P (ico ) C (ico ) ) should be small for high frequencies in order 
to tolerate measurement noise and model errors. However, 
if P(s) has an unstable pole p > 0, then no stabilizing con- 
troller C can push | T (ico) | entirely below the curve above. 
As a consequence, the unstable pole p gives a lower limit 
on the bandwidth of the closed-loop system. 



Figure 9 Limitation from unstable zero. The sensitivity 
function 5 (ico) = (1 + P(ico)C(ico)) _1 should be small for 
low frequencies in order to reject disturbances and fol- 
low reference signals at least on a slow timescale. How- 
ever, if P(z) = 0, z > 0, then no stabilizing controller C 
can push the sensitivity function |S(ico) | entirely below the 
curve above. As a consequence, the unstable zero z puts 
an upper limit on the speed of disturbance rejection in the 
closed-loop system. 


consequences on the imaginary axis. In particular, it fol- 
lows from the maximum modulus theorem for analytic 
functions that the specification 
2 

|T(iq})| < , for all to 

y/l + cv 2 /p 2 

is impossible to satisfy (see figure 8). This is a rigorous 
mathematical statement of the heuristic idea that the 
feedback loop needs to be at least as fast as the fastest 
unstable pole. 

Similarly, if z is an unstable zero of P(s), then 
the sensitivity function must satisfy S(z) = 1 for 
every stabilizing controller C. As a consequence, the 
specification 


|S(i(o)| < 


Vl + z 2 /co 2 


for all (jo 


is impossible to satisfy (see figure 9). 

The situation becomes even worse if there is both 
an unstable pole p and an unstable zero z, especially 
if they are close to each other. It follows from the 
maximum modulus theorem that 

z + p 


max |S(ito)| ^ 

aiel 


z - p 


for every stabilizing C. If S is very large, then the 
same is true for 7’, since S + T = 1. Hence, if z = p, 
then poor robustness to model errors and amplification 
of measurement noise make the system impossible to 
control. 


6 Optimizing Multivariable Controllers 

So far, we have only discussed systems with one input 
and one output. However, all the previous results 
have counterparts for multivariable systems, where 
P(s ) and C(5) are matrices. The main difference with 
multivariable systems is that the number of relevant 
input-output maps is large, and there is therefore a 
stronger need to organize the variables and computa- 
tions involved in control synthesis. It is common to use 
a framework with four categories of variables (see fig- 
ure 10). The controller is a map from the measurement 
vector y to the input vector u. Usually, the controller 
should be chosen such that the transfer matrix from 
disturbances w to errors z becomes “small” in some 
sense. 

It turns out that all closed-loop transfer functions 
from w to z that are achievable using a stabilizing 
controller can be written in the Youla parametrization 
form: 


Pzw(s)-Pzu(s)Q(s)P yw (s), (3) 

where P zw (s), P Z u(s), and P yw (s) are stable transfer 
matrices fixed by the process, and where every stable 
Q(s) corresponds to a stabilizing controller. 

The Youla parametrization is particularly simple 
to derive for stable processes; let the multivariable 
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Figure 10 A framework for optimization of multivariable 
controllers. The vector w contains external signals, such as 
disturbances, measurement errors, and operator set points. 
The vector z contains signals that should be kept small, 
usually deviations between actual values and desired val- 
ues. The controller is chosen to minimize the effect of w 
on z. 

feedback system be given by the equations 


z 


Pzw (•£ ) 

Pzu(s) 

W 

y_ 


Pyw (•$) 

Pyu (■$)_ 

U 


u = -C(s)y. (5) 

EUminating u and y gives (3), with 

Q(s) = C(s)[I + P yu (s)C(s)]- 1 . 

Exploiting the stability of the process P, it is straight- 
forward to verify that the closed-loop system is sta- 
ble if and only if Q(s) is stable. In the more general 
case of an unstable process, the Youla parametrization 
can be derived by first applying a stabihzing prelimi- 
nary feedback and then proceeding as above. Given the 
Youla parametrization of all achievable transfer func- 
tions from w to z, the control synthesis problem can 
be formulated as the problem of selecting aQ(s) that 
gives the desired properties of P zw - PzuQPyw- Recall 
that w is a vector of external signals, and we know 
from section 4 that frequency separation of different 
signals is often essential. A natural design approach is 
therefore to select Q(s) to minimize 


where cr(-) denotes the largest singular value. Analytic 
formulas can be given for the optimal Q(s) and the 
corresponding controller C(s) for both the H 2 -norm 
and the H 0 o-norm. Some key ideas of the theory will 
be described in the next section. 


7 Linear Quadratic Control 


The frequency-domain viewpoint of control theory 
described in earlier sections dominated control theory 
before 1960 and has been of central importance ever 
since. However, between 1960 and 1980 there was a 
second wave of development, triggered by the introduc- 
tion of computers for simulation and implementation. 
Process models in state-space form, either differential 
equations or difference equations, then started to play 
a central role. With reference to figure 10, a state-space 
model in continuous time can be expressed as follows: 


Process 


-^-x(t) = Ax(t ) + Bu(t ) + w x (t), 
d t 

y(t) = Cx(t) + w y (t). 


Here, the control objective is stated in terms of the map 
from w = {w x , iv y ) to z = ( x,u ). For example, in lin- 
ear quadratic Gaussian optimal control, iv is modeled 
as Gaussian white noise and the control objective is to 
minimize the variance expression 


E(x T Qx + u t Ru), 


where Q and R are symmetric positive-semidehnite 
matrices. This is equivalent to the H 2 -norm minimiza- 
tion discussed previously (see also figure 11). 

It turns out that the optimal controller can be written 
as a combination of two components: 


Controller 


4*- x(t ) = Ax(t) + Bu(t) 
at 

+ K[y(t) - Cx(t)], 


u(t) = -Lx(t). 


II W z (P Z w PzuQPyw )Wu/ 1| , 

where W z and W w are frequency weights selected to 
emphasize relevant frequencies for the different sig- 
nals. This approach connects control theory to the 
mathematical theory of functional analysis. 

The choice of norm is important. The two most 
common norms are the H 2 -norm 

\\G\\h 2 = tr(G(icu)*G(ia>)) dot 

and the H 0 0 -norm 

||G||h = maxcr(G(ia))), 

CO 


The first is a state estimator (also called an observer or 
Kalman filter), which maintains a state estimate x(t) 
based on measurements of y up to time t. The second 
is a state feedback law u = -Lx, which uses x(t) as if 
it were a measurement of the true state x(t) and takes 
control action accordingly (see figure 12). 

The main tuning parameters of the controller are the 
matrices K and L. They can be determined indepen- 
dently by solving two different optimization problems. 
The state feedback gain L is determined by solving the 
deterministic problem to compute 

r OO 

V*(xo) = min (x T Qx + u T Ru) dt 

“ Jo 
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Figure 1 1 Linear quadratic Gaussian control aims to mini- 
mize the output variance when the input has a given Gauss- 
ian distribution. This diagram illustrates two different prob- 
ability distributions for the paper thickness in a paper 
machine. When the variance is small, a smaller mean value 
can be used without increasing the risk of violating quality 
requirements. 


subject to x = Ax + Bu, x(0) = xo. This problem 
connects to classical calculus of variations [IV.6] 
and can be conveniently solved using dynamic pro- 
gramming, implying that the optimal cost V* (xq) must 
satisfy Bellman’s equation : 

0 = min ^ XqQxq + u r Ru + ^ [Ax o + Bu) "j . (6) 

current cost , 'CZ 

reduction of future cost 

The solution is a quadratic function V*(xo) = XgSxo. 
For linear dynamics and quadratic cost, Bellman’s equa- 
tion reduces to the Riccati equation, which is solved for 
S, and the accompanying optimal control law u = -Lx o 
gives the desired L. 

The Kalman filter gain K is determined to minimize 
the variance of the estimation error E|x - x| 2 . It turns 
out that this problem is the dual of the state feedback 
problem and can also be solved by dynamic program- 
ming. In this way, the linear quadratic Gaussian optimal 
control problem was completely solved in the 1960s. 
With the terminology of section 6, this provided a solu- 
tion to the H 2 - norm optimization problem. Moreover, 
the structure of the observer-based controller is impor- 
tant regardless of optimality aspects, since it gives an 
interpretation to all the controller states. This is useful 
in, for example, diagnosis and fault detection. 

In the 1980s it was proved that Hoo -optimal con- 
trollers can be derived in a way that is similar to the 



Figure 12 The solution of the linear quadratic Gaussian 
optimal control problem has a very clean structure, where 
the controller has the same number of states as the process 
model. Every controller state can be interpreted as an 
estimate of the corresponding state xj- in the process. The 
optimal input is computed from the state estimates as if 
they were measurements of the true state. 


way in which linear quadratic Gaussian controllers are 
derived. However, the previous minimization has to be 
replaced by a dynamic game: 

V* (xn) = minmax (x T Ox + u J Ru - y 2 w r w) dt. 

u w Jo 

The input u should therefore be selected to miniinize 
the cost under the assumption that the disturbance u> 
acts in the worst possible way. If V*(xo) is finite, it 
means that the resulting control law makes the H 
norm of the map from iv to z = (Q 1/2 x,R 1/2 u) smaller 
than y. The optimum is found by iteration over gamma. 

In many control applications it is necessary to put 
hard constraints on states and input variables. This 
results in nonlinear controllers, which cannot be rep- 
resented by a transfer matrix. Instead, it is common 
to implement such controllers using model-predictive 
control (MPC), solving constrained optimization prob- 
lems repeatedly in real time: at every sampling time, 
the state is measured and an optimal trajectory is com- 
puted for a fixed time horizon into the future. The com- 
puted trajectory determines the control action for the 
next sampling interval. After this, the state is measured 
again and a new trajectory is optimized. The method 
is therefore also called receding -horizon control (see 
figure 13). 

MPC controllers have been used in the process indus- 
try since the 1970s, but the development of theory and 
supporting software has been most rapid during the 
past fifteen years. 

8 Nonlinear Control 

A remarkable feature of feedback is that it tends to 
reduce the effects of nonlinearities. In fact, one of the 
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Figure 13 In MPC a finite-horizon optimal control prob- 
lem is solved at every time step. The state is measured and 
an optimal control sequence (dashed) is computed for a 
fixed time horizon into the future. The computed trajectory 
determines the control action (solid) for the next sampling 
interval. After this, the state is measured again and a new 
trajectory is optimized. 

most important early success stories of feedback con- 
trol was Black’s invention of the feedback amplifier in 
1927. The invention made it possible to build telecom- 
munication lines over long distances by reducing the 
inevitable nonlinear distortion in the amplifiers. 

For this reason, much of control theory has been 
developed based on linear models. Even though nonlin- 
ear effects are always present, they can often be ignored 
in the design of feedback controllers, since they tend 
to be attenuated by the feedback mechanism. Never- 
theless, when high performance is required, nonlinear 
effects need to be considered. 1 

The most straightforward approach to dealing with 
nonlinearities is to ignore them at the design stage but 
still take them into account in the verification of the 
final solution. Usually, verification is done by extensive 
simulations using nonlinear models of high accuracy. 
However, it is also possible to do more formal analysis 
using a nonlinear process model. We will next describe 
some tools for this purpose. 

The most well-known concept for the analysis of non- 
linear systems is Lyapunov functions , i.e., nonnegative 
functions of the state x that are decreasing along all 
the trajectories of the system. The existence of such a 
function proves stability of the nonlinear system. How- 
ever, it is often desirable to also go beyond stability and 
prove bounds on the input-output gain from distur- 
bance iv to error z. This can be done using the slightly 
more general concept of storage functions. For example, 


1. Black's feedback amplifier traded power for reduced distortion. 
However, modem efforts to build energy-efficient electronics are push- 
ing researchers in the opposite direction: not to give away power 
losses but instead to accept some level of nonlinear distortions. 


to prove that the output integral J ! z | p dt is bounded by 
the input integral j \ w\ p dt along all trajectories start- 
ing and ending at an equilibrium, it is sufficient to find 
a storage function V (x) such that 

^V(x(t)K \w(t)\ p - \z(t)\ p 
dt 

along all trajectories. Given V, the inequality can often 
be verified algebraically for each t separately. 

Lyapunov functions and storage functions can often 
be interpreted as measures of energy content in the sys- 
tem. A system is stable if the energy content decreases 
along all trajectories. Similarly, if the input integral 
measures the amount of energy that is injected into 
the system through the input, and the output integral 
measures how much energy is extracted through the 
output, then the input-output gain can be at most 1. 
Such interpretations are particularly useful when the 
states of the system have physical meaning. 

A common way to derive Lyapunov functions and 
storage functions is via linearization. If a linear feed- 
back law is designed based on a linearized process 
model, it often comes together with a quadratic Lya- 
punov function. For example, if it was obtained by 
optimization of a quadratic criterion, then the optimal 
cost defines a quadratic Lyapunov function. If an H«,~ 
optimal control law was obtained by solving a min- 
max quadratic game, then the solution comes with 
a corresponding quadratic storage function. In both 
cases, the quadratic functions obtained in the linear 
design can be used for verification of a closed-loop 
system involving the nonlinear process. An alterna- 
tive approach is to abandon the linear controllers and 
instead use the quadratic Lyapunov/storage functions 
as starting points for the design of nonlinear feedback 
laws. 

Nonlinear controllers can also be designed using 
dynamic programming and Bellman’s equation. A com- 
mon problem formulation is to associate every control 
law u = p(x) with a cost 

r 00 

V p (xo)= £(x(t),u(t)) dt, 

Jo 

where x = f(x,u), u = p(x), and x(0) = xq, and 
then to find the control law that gives the minimal cost 
V*(xq). The corresponding Bellman equation is gen- 
erally very difficult to solve, but it can still be used to 
derive approximatively optimal control laws. For exam- 
ple, given a function V and a control law p, if the 
“Bellman inequalities” 

£(x, p(x)) + ‘^f(x, p(x)) ^ 0 ^ <x£(x, u) + ^f(x, u) 
ox ox 
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hold for all x, u, and for some a > I , then p approxi- 
mates the optimal control law in the sense that (x ) <r 
oCV* (x). 

The statement is very general and widely applica- 
ble. If « = 1, the inequalities reduce to the classi- 
cal Bellman equation. Conversely, if <x is big, then the 
inequalities are easier to satisfy, but the control law is 
likely to be further away from the optimum. The main 
difficulty is, in general, to come up with good candi- 
dates p and V for which the Bellman inequalities can 
be verified. Two main approaches have been discussed. 
One is to use linearization, which gives solutions that 
are also approximately optimal for the nonlinear prob- 
lem. The second approach is to use MPC: the infinite- 
horizon cost can be approximated by a finite-horizon 
time-discrete cost, which can be optimized online. This 
means that p is defined implicitly in terms of solutions 
to finite-dimensional optimization problems. The Bell- 
man inequalities can then be used to form conclusions 
about the performance of the MPC controller. 

9 Ongoing Research: Distributed Control 

A new trend in control theory has emerged in the 
twenty-first century. The aim is to provide understand- 
ing and methodology for the control of large-scale sys- 
tems, such as the power grid, the Internet, and living 
organisms. Classical control theory is insufficient for 
several reasons. 

• Control action is taken in many different locations, 
but with only partial access to measurements and 
process dynamics (see figure 14). 

• Even if optimal distributed controllers are com- 
putable, they are generally extremely complex to 
implement. 

• The design and functionality of control and com- 
munication systems are increasingly intertwined 
with complex software engineering. 

All of these issues have been recognized and dis- 
cussed for a long time. However, encouraging progress 
has recently been made, leading to very stimulating 
developments in control theory. 

Dual decomposition is an old idea for the solution 
of large-scale optimization problems. The idea is, for 
example, applicable to the minimization of an objective 
function with a large number of terms that are coupled 
by relatively few shared variables. The method means 
that the coupling between shared variables is removed 
and is replaced by a penalty for disagreement. The 



Figure 14 The term “distributed control” refers to a situ- 
ation in which action is taken in many different locations 
but with only partial access to measurements and pro- 
cess dynamics. In the diagram, “P” denotes fixed process 
dynamics, while “C” denotes controllers to be designed. 

penalty is given by a price variable, also called a dual 
variable or Lagrange multiplier, which is then updated 
using a gradient algorithm. 

Dual decomposition has the important advantage 
that once the price variables are fixed, individual terms 
of the objective function can be optimized indepen- 
dently with access to only their own local variables. The 
need for communication in MPC applications is signifi- 
cant, since gradient updates of price variables requires 
comparison of shared variables along the entire opti- 
mization horizon. Nevertheless, the idea has shown 
great promise for use in the control of large-scale 
systems. 

A second branch of research is devoted to problems 
with limited communication capabilities in the con- 
troller. It was already recognized by the 1960s that 
such problem formulations often lead to highly com- 
plex nonlinear controllers, even for very simple pro- 
cesses. The reason for this is that there is an incen- 
tive for the controller to use the process as a commu- 
nication device. This incentive disappears as soon as 
the controller has access to communication links that 
are faster than the information transfer of the plant, 
an assumption which is often reasonable with modern 
technology. Hence, an interesting branch of research is 
now the investigating of how to derive optimal (linear) 
controllers when communication is limited but is faster 
than the process. 

A third research field is devoted to a particular class 
of systems known as positive systems, or (in the nonlin- 
ear context) monotone systems. Positive systems arise, 
for example, in the study of transportation networks 
and vehicle formations. Unlike general linear systems, 
the positivity property makes it possible to do stability 
analysis and control synthesis using linear Lyapunov 
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functions rather than quadratic ones. This makes a dra- 
matic difference for large-scale systems, since the num- 
ber of optimization parameters then grows linearly 
with the number of states and system components. 
Moreover, verification of the final solution can be done 
in a distributed way, where bounds on the input-output 
gain can be checked component by component. 

Altogether, these different directions of progress 
(and others) show that control theory is still in a very 
active phase of development, and an article like this 
written ten years from now would probably include 
important sections that go far beyond what has been 
presented here. 

Acknowledgments. The illustration in figure 11 is from 
Astrom and Wittenmark (2011), and the bicycle example is 
from Astrom et al. (2005). 
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IV.35 Signal Processing 

John G. McWhirter and Ian Proudler 


1 Introduction 

Since the advent of integrated circuit technology and 
the resulting exponential surge in the availability of 
affordable, high-performance, digital computers, digi- 
tal signal processing (DSP) has become an important 
topic in applied mathematics. The purpose of this arti- 
cle is to provide a simplistic overview of this topic 
in order to illustrate where and why mathematics is 
required. 

Since most naturally occurring signals, such as elec- 
tromagnetic waves, seismic disturbances, or acoustic 


sounds, are continuous in nature, they must be sam- 
pled and digitized prior to digital signal processing. 
This is the role of an analog-to-digital converter, which 
takes the output from a sensor such as an antenna 
or a microphone and produces a sequence of uni- 
formly sampled values x(n) (n = 0,1,2,...) repre- 
senting the measured signal at the corresponding time 
instances t n . 

Low-frequency signals are usually digitized directly, 
resulting in a sequence of real numbers. High-fre- 
quency signals of a sinusoidal nature are often repre- 
sented in terms of their phase and amplitude relative to 
a given high-frequency reference tone (pure sinusoid), 
resulting in a signal of much lower frequency, whose 
values are represented by complex numbers. This pro- 
cess, generally referred to as down conversion, is par- 
ticularly important for electromagnetic signals in radio 
communications, for example. 

The rate at which a signal is sampled depends on 
its spectral content and, in particular, on the highest- 
frequency component that it contains (possibly after 
down conversion to a lower-frequency band). Assuming 
that the highest frequency, measured in hertz (cycles 
per second), is /, the signal must be sampled every 
At seconds, where At ^ 1/(2/). This fundamental 
rule was established by Shannon, and the correspond- 
ing sampling frequency f s must satisfy f s ^ 2/ (the 
Nyquist rate). Having decided on (or been restricted to) 
a sampling frequency / s , it is important that any sig- 
nal to be digitized at this rate does not contain any 
frequency components for which / > f s /2, since these 
will be indistinguishable from their cyclic counterparts, 
i.e., those with the same frequency mod f s . This cyclic 
wraparound in the frequency domain is referred to 
as aliasing, and it can cause severe signal distortion. 
To prevent this from occurring in practice, signals 
must be strictly band limited before analog-to-digital 
conversion. 

In accordance with the discussion above, the input to 
a DSP device is a sequence of signal values x(n), gen- 
erally referred to as a time series, which may be pro- 
cessed on a sample-by-sample basis or in convenient 
blocks depending on the application. 

2 Digital Finite Impulse Response Filtering 

One of the most important tasks in DSP is the appli- 
cation of a particular linear time-invariant (LTI) oper- 
ator A that is defined in terms of its finite-length 
response to a unit impulse [1,0,...]. The impulse 
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Figure 1 A finite impulse response filter structure based 
on a tapped delay line, where z -1 denotes a unit delay. 


response may be expressed as A = [ao,ai,...,ai-i]. 
In response to a general open-ended input time series 
x(n), n E Z, it produces the output sequence yin) = 
Ef=o ctix(n - l). In other words, it generates a convo- 
lution of the input sequence with its impulse response. 
In DSP, this type of operator is referred to as a finite 
impulse response (FIR) filter. Elements a; of the impulse 
response are therefore regarded as filter coefficients, 
the order of the filter being L - 1. The reason for think- 
ing of this LTI operator as a filter will become apparent 
shortly. 

A digital FIR filter may be implemented quite natu- 
rally using a tapped delay line, as illustrated schemati- 
cally in figure 1. The tapped delay line is assumed to be 
filled with zeros initially. Sample x (0) arrives at time to 
and is multiplied by coefficient a o to produce y (0) . The 
sample x(0) is stored and moves one place along the 
delay chain on the next clock cycle, where it is multi- 
plied by coefficient a i. At the same time, the next sam- 
ple x(l) arrives. It is multiplied by ao, and the product 
aox(l) is added to aix(0), thus generating y( 1). The 
process continues on every clock cycle, even when the 
tapped delay line is full, in which case one sample (the 
oldest) is forgotten every time. In effect, the stored data 
vector shifts one place to the right every sample time 
and it is this property that will lead to a toeplitz data 
matrix [1.2 §18], as discussed in section 8. 

Let the input to the filter A take the form of a uni- 
formly sampled sinusoid x(n) = e lum , where, for con- 
venience, to = 2nf is the angular frequency in radians 
per sample time. It follows that the output sequence 
yin) = Xi=o a ie lw(n ~ l) can be written as yin) = 
A(ui)e 1<on , where A( to) = Xi=o a i e ~ lwl is referred to 
as the frequency response of the filter A. A( to) is, 
of course, the Fourier transform of the finite discrete 


sequence of filter coefficients, and its inverse is given 
by 

aj=— J Ai(jo)e lwl . 

By taking the Fourier transform of the filter output 
sequence yin) as defined above, it is easy to show that 

L - 1 00 

Y(to) = X ai X xin - l)e~ lwn 

1=0 n-~oo 

and, hence, by substitution of indices, that Y(to) = 
Aiuj)Xi(jo). This is, of course, just a manifestation 
of the FOURIER CONVOLUTION THEOREM [11.19], which 
states that the Fourier transform of a convolution is 
simply the product of the Fourier transforms of the 
individual sequences. Clearly, then, the effect of con- 
volving a digital sequence with a chosen set of filter 
coefficients is to modify the frequency content of the 
signal by multiplying its spectral components by those 
of the filter. For example, a low-pass filter is designed 
to multiply the high-frequency components by zero 
and the low-frequency components by one to a suit- 
ably high degree of approximation. With an FIR filter of 
the type outlined above, a large number of filter coeffi- 
cients may be required to achieve the accuracy required 
for low-pass, high-pass, or band-pass filters, so digital 
filters often utilize infinite impulse response filters, as 
discussed in section 5. 

3 Discrete Fourier Transform and 
Fast Fourier Transform 

Section 2 introduced the Fourier transform of a finite 
discrete sequence cq as A(ut) = X/^o 1 aie~ lwl , which 
describes, for example, the frequency response of the 
corresponding FIR filter. This is defined for any fre- 
quency to. However, due to the finite length of the 
sequence, not all values of A (to) are independent. In 
fact, the entire response may be inferred from the 
response at the L discrete frequency values to n = 
2nnlL in = 0, 1, . . . ,L - 1), i.e., in terms of the discrete 
Fourier transform (DFT) values 

A^Af 2 ^) = X a t W L nl , 

where Wi = e 2m/i constitutes the Ith root of unity in 
the complex number field. The original sequence ai can 
easily be recovered from these values (referred to as the 
DFT coefficients) by means of the inverse DFT, which is 
given by 

1 1-1 

(i = 0,l,...,I-l). 

L n = 0 
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The inverse DFT can easily be proved by substituting 
for A n and noting that the geometric series sums to 
LSi m , where <5/ m denotes the Kronecker delta, which 
takes the value unity when l = m and zero otherwise. 

In section 2 it was pointed out that the Fourier 
transform of the convolution of a discrete sequence 
with a set of filter coefficients is equivalent to the 
product of their individual Fourier transforms. A sim- 
ilar result applies in the case of fixed-length discrete 
sequences transformed using the DFT but the equiv- 
alence involves a cyclic convolution. Let H n and X n 
be the DFT coefficients corresponding to the finite- 
length sequences h k and x^, and define Y n = H n X n 
(n = 0, 1, . . . , L - 1) so that 

yi = 7 X XnWjA X h k w L kn . 

L n = 0 fc= 0 

Swapping the order of the two summations leads to 
yi = Xh k X X n W? a ~ kh , 

k= 0 n= 0 

where for any integer r, (r)i denote r mod (L - 1). 
Consequently, yi = Xfc=o which takes the 

form of a cyclic convolution. This proves to be a very 
useful relationship, although, in order for it to apply 
to a general (noncyclic) convolution, it is necessary to 
use a technique such as zero padding, whereby the 
input sequence of length M, say, is extended to length 
2 M - 1 by adding new zero elements so that the out- 
put of the cyclic convolution contains, as a subse- 
quence, the M terms of the original noncyclic convo- 
lution. At first sight one might ask why one bothers 
doing a finite-length convolution in this roundabout 
manner. The answer, of course, lies in the ubiquitous 
fast FOURIER transform [11.10] (FFT), which, for a 
transform of length N, where N = 2 n , requires only 
O(nN) arithmetic operations as opposed to 0(N 2 ) for 
the standard DFT. This reduction is so significant for 
large values of N that in order to compute a convolu- 
tion of length M, which would normally require O (M 2 ) 
arithmetic operations, it is well worth the overhead of 
zero padding the sequence so that M — 2 m , comput- 
ing the FFT of both convolution sequences, multiply- 
ing the resulting FFT coefficients pairwise, and comput- 
ing the inverse FFT. In this context it should be noted 
that the FFT is not just an efficient algorithm for trans- 
forming to the frequency domain in the classical sense 
but is also a very important “computing engine,” find- 
ing application throughout the field of DSP and in the 
practical application of LTI operators more generally. 


4 The z-Transform 

A fundamental and widely used mathematical tool in 
DSP is the z-transform, which includes the Fourier 
transform as a special case but has much broader rele- 
vance. The z-transform of a time series x (n) is denoted 
by X(z) and is defined as X (z) = X-<» x(n)z~ n , where 
z G C. If z is represented as z = re 10 , then it is evi- 
dent that X(z) is equivalent to the Fourier transform 
of the associated sequence x(n)r~ n , which converges 
provided X-oo \x(n)r~ n \ < co. In effect, the sequence 
of values x(n) constitutes the coefficients of a Laurent 
series expansion of X(z) about the origin in the com- 
plex plane. Clearly, the z-transform of a sequence is 
meaningful only at values of z for which the doubly 
infinite sum converges. This defines the region of con- 
vergence (ROC), which can be analyzed by splitting the 
transform into the sum of its causal component, involv- 
ing only the coefficients for which n ^ 0, and its anti- 
causal component, involving only coefficients for which 
n < 0. In general, the causal component converges at 
values of z for which | z | is large enough, i.e., \z\ > r c for 
some value of r c . Conversely, the anticausal component 
converges provided |z| is small enough, i.e., |z| < r a 
for some value of r a . Provided r c < r a , the ROC consti- 
tutes an annulus centered on the origin of the complex 
plane. If the unit circle lies within the ROC, then the z- 
transform evaluated on the unit circle is identical to the 
Fourier transform. It is worth noting that the inverse 
z-transform is given by 

- — 7 X(z)z n ~ 1 dz, 

2m Jc 

where C denotes a counterclockwise closed contour 
within the ROC that encloses the origin. Its valid- 
ity can be demonstrated using the cauchy integral 
theorem [IV.l §7], which states that 

J-fz-dzJ 1 

2m Jc [0 otherwise. 

Note that the z-transform is a linear operator. Note also 
that, if y (n) = x(n - m) for a fixed integer m, i.e., the 
sequence y(n) is simply the sequence x(n) delayed 
by m sample intervals, then Y (z) = z~ m X(z). For this 
reason, z _1 is generally referred to as the unit delay 
operator. The final property noted here is particularly 
useful and constitutes a generalization of the Fourier 
convolution theorem noted earlier. It states that, if 
two sequences y(n) and x(n) are related according 
to y(n) = Xi=o a-ix(n - l), then Y (z) = A(z)A'(z). In 
other words, the z-transform of a discrete convolution 
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Figure 2 A digital infinite impulse response filter composed 
of two tapped delay lines. The second one filters the out- 
put sequence and feeds the result back to be added to the 
output of the first. 

is the product of the individual z-transforms. Although 
L can be infinitely large, we have chosen to express the 
relationship here as it applies to an FIR filter with coef- 
ficients ai (i = 0, 1, . . . ,L - 1). For this reason, an FIR 
filter is often represented for analytic purposes by its 
z-transform, which (for causal filters) takes the form of 
a polynomial in z _1 . 

5 Digital Infinite Impulse Response Filtering 

FIR filters are represented mathematically by a finite 
discrete convolution (LTI operator). Another important 
class of digital filters involves a discrete LTI operator 
with infinite impulse response. It has the form 

1-1 M—l 

yin) = ^ aix(n - I) - X E m yin ~ tn). 

1=0 m = 1 

The key point to note here is that the filter output 
y(n) consists of the filtered input samples, as in an 
FIR filter, as well as delayed values of the filter out- 
put y(n), n e N + , linearly combined using the coef- 
ficient vector [hi, ■ ■ ■ , bitf-i]- Feeding the output of 
the filter back in this recursive manner, as illustrated 


in figure 2, leads to a filter whose response to a unit 
impulse is no longer finite in duration. As a result, 
the bounded-input bounded-output stability of these 
infinite impulse response (IIR) filters is no longer guar- 
anteed and careful analysis is required. The IIR fil- 
ter equation incorporates two distinct FIR filters, one 
applied to the input sequence and the other applied 
to the output sequence, before it is fed back. In terms 
of the associated z-transforms it may be expressed in 
the form B(z)Y(z) = A(z)X(z) (where bo = 1), so 
Y(z) = T(z)X(z ), where the overall transfer function 
is given by the rational form T(z) = A{z)/B(z). The 
zeros of the polynomial A(z) are referred to as the 
zeros of the filter, while the zeros of Biz) are referred 
to as the poles of the filter. The location of the poles in 
the complex plane is a key property for characterizing 
the ROC of T(z). Taken together, the location of the 
poles and zeros plays an important role in determining 
the characteristics of an IIR filter. 

The design of a digital filter, either FIR or IIR, 
intended to achieve specific objectives such as given 
pass-band frequencies, stop-band frequencies, and roll- 
off rates is now a well-established discipline. It is not 
possible to design a filter of finite order so that it has 
arbitrary properties. The approach therefore basically 
amounts to curve fitting. Techniques such as least- 
squares minimization, linear programming, minimax 
optimization, and Chebyshev approximation can all be 
used. 

6 Correlation 

So far in this article we have considered only deter- 
ministic signals, whereas, in practice, most measured 
signals are subject to random measurement noise and 
other statistical fluctuations. In the simplest case these 
can be modeled as an additive white Gaussian noise 
process with zero mean, so the measured sequence 
is described by y' (n) = yin) + v n , where v n rep- 
resents a random noise sequence whose values are 
taken from a Gaussian distribution with E[v n ] = 0 and 
E[v iV *] = a 2 5ij. Throughout this article, £[■] denotes 
the statistical expectation operator. 

Correlation is another important procedure that 
arises in DSP. The correlation between two open-ended 
random time series x(k) and y(k) is given by C xy in) = 
E[x(k)y* (n + k)]/a x c* y , where ot x = £[|x(k)| 2 ] 1/2 
and a y = E[\yik)\ 2 ) 112 are normalization factors here 
assumed to equal unity. The correlation is a func- 
tion of the relative shift in sample time between the 
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two sequences and, making the usual assumptions 
about ergodicity and wide-sense stationarity (station- 
arity over a suitably long time interval), it is estimated 
as c xy (n) = Xk=-K x(k)y* (n + k), where the value 
of K is set as large as necessary (or as large as possi- 
ble in any practical situation). The specific case where 
the sequence y(k) is identical to the sequence x(k ) 
yields the autocorrelation function C X x ( n ) with its cor- 
responding estimate c xx (n) = Xk=-K x(k)x* (n + k). 
The autocorrelation function is of particular interest 
since the Wiener-Kinchene theorem states that the 
Fourier transform of the autocorrelation function is 
equivalent to the power spectral density of the sig- 
nal, i.e., C xx (iv) = |X(co)| 2 , where X(co) denotes the 
Fourier transform of x(k). Assuming that the summa- 
tion limits are infinite, this can easily be deduced as a 
simple corollary of the Fourier convolution theorem. 
The Wiener-Kinchene theorem is useful as there are 
numerous situations in DSP where it is simpler, or more 
convenient, to measure the autocorrelation function 
than the spectrum of the underlying signal. Note, how- 
ever, that the power spectral density does not provide 
any phase information. 

The case where an open-ended time series y(k) is 
to be correlated with a finite sequence x(k) leads to 
a correlation function estimate of the form c X y(n) = 
Xk=o x(k)y* (n + k), where the number of sample val- 
ues for each n is simply the length I of the finite 
sequence. Even if the value of L is limited, this expres- 
sion provides an unbiased estimator for the correla- 
tion function and so is very useful in practice. For a 
single instantiation of the sequence y(k), it can be 
viewed as a measure of the similarity between the finite 
sequence x(k) and a finite segment of the sequence 
y(k) beginning at the nth sample. The value of n for 
which this estimator takes its maximum value is of par- 
ticular interest in the context of active radar and sonar. 
In such systems, an electromagnetic or acoustic pulse 
whose shape is described by the sequence x(k) is trans- 
mitted by the antenna at sample time zero, say. After 
propagating to a reflective object in its path, some of 
the pulse energy returns to the antenna, delayed by its 
two-way time of flight and much weaker due to prop- 
agation losses. This much weaker pulse is also sub- 
ject to electronic receiver noise and can be modeled as 
y(n) = fix ( n -m)+v n , where the round-trip delay is m 
sample intervals and, as before, v n represents additive 
white Gaussian noise. In this situation, 

K - 1 

c X y(n) = [x(k)x*(n + k - m) +x(k)v n+k ] 

k= o 


and, for a suitably designed pulse, the expected value of 
c xy (n) will attain its maximum value for n = m. This 
provides an estimate of the time of flight of the pulse, 
and hence the distance to the corresponding reflector. 

Computing the correlation function estimate for this 
purpose is generally referred to as matched filtering. 
The reason for this becomes clearer by rewriting the 
estimator in the form c xy (n) = ’£k=o x (K-k-l)y(n + 
K-l-k) and introducing the order-reversed sequence 
x(k) = x(K + 1 - k) so that the estimator is given by 
'Z,flQX(k)y(n+K - 1 - fc). This takes the form of a dis- 
crete convolution corresponding to a digital FIR filter 
with coefficients x(k). In effect, the correlation func- 
tion can be estimated by computing a convolution with 
the coefficients in reverse order. 

7 Adaptive Filters 

In the case of an adaptive filter, the coefficients are not 
specified in advance but must be computed as a func- 
tion of the input data with a view to achieving a partic- 
ular objective such as maximizing the output signal-to- 
noise ratio. This typically involves the minimization of 
a suitable cost function involving the filter coefficients 
(weight vector) and is sometimes implemented by feed- 
ing the filter output back to form a simple “closed-loop” 
control system. 

Consider, for example, a signal estimation problem in 
which a desired signal s(n) is subject to an unknown 
filtering operation H(z ) such that the signal received 
is given by x(n) = 21 h k s(n - k) + v(n), where v(n) 
denotes an interference or noise signal uncorrelated 
with s(n). With the aim of estimating the desired sig- 
nal, the received signal is processed by an adaptive fil- 
ter W (z) whose output is d(n) = 2.f=o w k.x(n - k). 
Assume that a noisy version of the desired signal 
denoted by y (n) = s(n) + r\(n) is also available, where 
the noise q(n) is again uncorrelated with s(n). The 
coefficients of the adaptive filter are chosen to ensure 
that the reference signal y(n) and the output of the 
adaptive filter d{n) are as close as possible in terms 
of their mean squared error J = £[|e(n)| 2 ], where 
e(n) = y(n ) - d(n). This approach works because, as 
we shall see below, the solution involves correlations 
between certain signals, and the two noise terms, being 
uncorrelated, have no effect on the result. An adaptive 
filter of this type is illustrated in figure 3. 

Minimizing the cost function J with respect to each 
of the adaptive filter coefficients (often referred to as 
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Figure 3 Schematic of a basic adaptive filter. The difference 
between the output signal din) and the reference signal 
y(n ) is used to compute an update to the weight vector 
using a suitable adaptive algorithm. 

weights) leads directly to the Wiener-Hopf equations 

A'-l 

r(k) = X wjq(k-j) ( j = 0,1 1), 

1=0 

where 

r(k) = E[y(n)x* (n - fc)], q(k) = E[x(n)x* (n - k)] 

denote the cross-correlation and auto-correlation func- 
tions, respectively. 

Because these equations take the form of convolu- 
tions, the solution may be written formally in terms of 
the corresponding z-transforms as W(z) = R(z)/Q(z). 
However, this expression is of little benefit when it 
comes to evaluating the filter weights in practice. 
Instead, the Wiener-Hopf equations may be written in 
the form Mw = r, where w = [wo, wi, . . . , wk-i] 
is the weight vector and M e C KxK is termed the 
covariance matrix, with individual elements given by 
mij = q(i - j), and r e C K is known as the cross- 
correlation vector with elements given by r* = r(i). 
This matrix equation may be solved using conventional 
linear algebra techniques, ideally exploiting the fact 
that the covariance matrix is of Toeplitz form. 

Note that the Wiener-Hopf equations are formu- 
lated in terms of the ideal ensemble-averaged cross- 
correlation and auto-correlation components, which 
are not known and would have to be estimated from 
the data in any practical situation. It is generally more 
appropriate to estimate the filter coefficients directly 
from the data by defining a cost function (f(w \ D ), 


say, where D denotes the data). One can then solve 
iv = arg min/(ic | D). Various different cost func- 
tions have been proposed. Some have the advantage 
that a reference signal is not required (blind adap- 
tive filtering). For the specific scenario that led to the 
Wiener-Hopf equations, an appropriate cost function 
would be f(w | D = [X,y]) = \\Xiv - y || 2 , where 
1 1 „Y 1 1 2 = (X*X) 1/2 , in which case w is the solution to 
the linear system Xw = y. 

However, note that, as in the signal estimation prob- 
lem above, it is usually the case that y is a noisy copy 
of the desired signal, i.e., y = s + q. Furthermore, as 
discussed below, it can be advantageous to take more 
measurements than the number of unknown filter coef- 
ficients, in which case there are more equations than 
unknowns and the matrix X is rectangular. Thus, in 
general, there is no exact solution to the linear system 
Xw = y and we have to turn to the least-squares 
solution [IV.10 §7.1]. There are many approaches to 
the least-squares solution of Xw = y, the appropriate 
choice depending on the properties of X. If X*X is of 
full rank, then w is the unique solution to the linear 
equations X*Xw = X*y, which are known as the nor- 
mal equations, and w = ( X*X)~ 1 X*y is known as the 
Wiener solution. Clearly, when X*X is not of full rank 
other methods have to be used to find the least-squares 
solution. 

The adaptive filter problem therefore requires find- 
ing the solution to X*Xw = X*y, where, in the pres- 
ence of noise, y = s + / 7. Here, 5 is the noise-free 
signal. The Wiener solution can then be written w = 
wo + (X*X)~ 1 X*q, where w 0 is the noise-free least- 
squares solution. Note that the ith element of X*q 
is 2]f =0 x*(fc - i + l)b(fc), which is an ergodic esti- 
mate of the cross-correlation between x(n - i + 1) and 
q(n). In the derivation of the Wiener-Hopf equations 
above, ensemble-averaged correlations were used, so 
E[x*(n - i + l)q(n)] = 0 and w = wq. However, with 
ergodic estimates this will not be the case as, in gen- 
eral, x * (k-i+l)q(k) =t= 0 for Unite K. Clearly, the 
larger the value of K in the ergodic estimate, the smaller 
the estimated cross-correlation will be. This is equiva- 
lent to taking many measurements of a noisy quantity 
and averaging to improve the accuracy of the resulting 
estimate. This is why adaptive filtering problems often 
have rectangular X matrices. 

Given that, in the expression for w,X*q =t= 0 and it is 
multiplied by (X*X)~ 1 , it is necessary to consider the 
condition number [IV.10 §1] of the matrix X as well 
as its rank. Inversion of an ill-conditioned matrix can 
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amplify any noise that is present. Here, the noise (X* 17 ) 
comes from the signals and not from roundoff effects. 
Forming and solving the normal equations has the dis- 
advantage that the 2-norm condition number of X*X is 
the square of that of X, which can cause the computed 
normal equations solution to be less accurate than the 
least-squares problem warrants in the presence of the 
noise. In this context, one particularly useful approach 
is to use the qr factorization [IV. 10 §2] X = QR of 
the matrix X, which allows the calculation of w without 
forming the covariance matrix X*X and thus squaring 
the condition number. 

8 Implementation 

A notable aspect of signal processing is the need to 
have low-complexity algorithms. This is because sig- 
nal processing is often to be found on equipment 
that has limited computing power, such as battery- 
powered devices. There is therefore great interest in 
devising algorithms that require as little computation 
as possible. 

The most notable method for reducing the computa- 
tion in a signal-processing algorithm is to calculate the 
solution to one problem based on that of another, i.e., 
to use recursion. For adaptive filters the most widely 
used method is a recursion in time: calculate the solu- 
tion at time n from that at time n - 1. This leads to 
the recursive least-squares (RLS) family of algorithms. 
The oldest member of this family is derived from the 
Wiener solution by means of the sherman-morrison- 
woodbury formula [IV.10 §3]. This lemma expresses 
the inverse of the matrix X*X in terms of the inverse 
of a smaller matrix. Specifically, the data matrix at time 
n can be written Xin) = [X T (n - 1) x(n)] T and 
it is possible to express (X* (n)A(n ))" 1 in terms of 
(X* (n - l)X(n - 1 )) _1 and rank-one terms involving 
xin). In fact, when substituted into the Wiener solu- 
tion win) = (X* in)Xin))^ 3 X* (n)y(n), the resulting 
formula can be simplified to an equation of the form 
w(n) = w(n - 1) + giyin) - x*(n)w(n - 1)). The 
vector g is known as the Kalman gain vector and is 
equal to (X* (n)X(n))~ 1 x(n). It specifies the direction 
of the update from win - 1) to win). The advantage 
of this formula is that it requires only OiN 2 ) opera- 
tions rather than the OiN 3 ) required to invert an N x N 
matrix. Note that the last term in the update formula 
can be seen as an a priori “fitting” error. This is a 
classic recursive update formula that appears in many 
iterative algorithms, such as the method of steepest 


descent [IV.l 1 §4.1] in optimization and the jacobi 
iteration [IV.10 §9] for linear systems. 

An important signal-processing algorithm in this 
class is the least-mean-squares (LMS) algorithm. The 
LMS algorithm can be derived in several ways. The orig- 
inal approach was statistical and was based on mini- 
mizing the mean square error. Another approach is to 
set the a posteriori fitting error yin) - x*in)win) to 
zero while minimizing the modulus of the change in 
the weight vector | w in) - w in - 1) | . In any event, the 
LMS update formula is win) = win-l)+xin))yin)- 
x* in)w(n - 1)). Note that this is the same as the RLS 
update when the Kalman gain g = x)n), that is, when 
X* )n)Xin) = I. The LMS algorithm therefore behaves 
like the RLS algorithm when the input signal xin) is 
a white noise process. When the input signal does not 
have a white power spectral density, the LMS algorithm 
is slower to converge than the RLS algorithm. This can 
be shown to be related to the fact that xin) is now 
a poor estimate of the direction of steepest descent, 
unlike the Kalman gain vector. Nevertheless, the LMS 
algorithm is very simple to implement and has proved 
to be very robust to violations of the assumptions that 
were originally made in its derivation. In fact, the LMS 
algorithm can be derived using minimax optimization, 
where one minimizes the maximum possible error (for 
a given class of system). Broadly speaking, the LMS algo- 
rithm minimizes the energy of the a priori fitting errors 
given the worst-case input noise. This can be shown to 
be related to a concept known as the Ha, -norm, which 
is the Loo -norm applied in the frequency domain: given 
a transfer function T(z), ||T(z)||h„ = sup^dTle 1 ")! ). 
As such, the LMS algorithm is a very useful tool. It 
is worth noting that the QR factorization-based least- 
squares algorithm family also includes time-recursive 
versions, although these will not be discussed further 
in this article. 

Another approach to reducing the computational 
load is to utilize any special structure in the problem. 
In adaptive filtering the rows of the data matrix X are 
formed from the delayed signal values xin - l) that 
appear in the convolution that defines the filter, i.e., 
yin) = Xi=o O-ixin - l). Each row of the data matrix 
corresponds to a different value of n. Hence the data 
matrix has a Toeplitz form. Thus, if X p in) is the data 
matrix for a pth-order filter at time n, then 

X p in) = [Xp-iin),b p in)] = [f p in),X p -iin - 1)], 

where b p in) and fpin) depend on the data. In fact, 
fpin) and b p in) are related to specialized adaptive 
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filtering problems known as forward and backward lin- 
ear prediction , respectively. Here, the desired signal 
y(n) is replaced by f v (n) or b p (n). This leads to a 
rank-two update equation for the covariance matrix. It 
is possible to incorporate time recursion with the order 
recursion resulting in an algorithm with O(n) opera- 
tions. Examples of this are the fast transversal filters 
algorithm and the RLS lattice algorithm. 

One issue with all of the recursive algorithms is 
numerical stability [1.2 §23]. The recursion formulas 
are valid only if all of the inputs are accurate. Inaccu- 
rate input values and numerical roundoff errors lead to 
errors in the output values that are then to be used as 
inputs in the next iteration. This can lead to explosive 
divergence of the algorithm. Because of the complex- 
ity introduced by recursion, the numerical stability of 
signal-processing algorithms is often explored through 
computer simulation. There has been relatively little 
theoretical work but it has led to some provably sta- 
ble recursive algorithms as well as provably unstable 
ones. 

9 Channel Equalization 

Adaptive filters find application in many situations, 
such as echo cancelation, modern hearing aids, and 
seismology, but most notably they are used in mod- 
ern digital communications systems. Broadly speak- 
ing there are two uses: system identification and sys- 
tem inversion. In system identification, given a sys- 
tem G(z) we seek to find a filter H(z) that minimizes 
|| G(z)X(z) - H(z)X(z) || for some fixed signal X(z). 
That is, for the given X(z), H(z) behaves like G(z), In 
system inversion, we seek to find a filter H(z) that min- 
imizes || Hiz)Giz)X(z) - X(z) ||. That is, H(z) undoes 
the effect of G(z). It is usual that the filter H{z) is to 
be drawn from a given class of filters. Typically, the 
class of FIR filters is used as this leads to tractable 
algorithms, but see the discussion below on decision 
feedback equalizers. 

System identification is used to investigate a real- 
world system by generating a model for it (the filter). 
The properties of the model (the position of poles and 
zeros, stability, etc.) can be used to infer the corre- 
sponding properties of the system. An example of this 
is system monitoring. If a set of models is generated 
over time, changes in the system can be detected. Ide- 
ally, deleterious changes can be detected before they 
cause a catastrophic failure in the system. 

System inversion is more common and is used where 
the system G(z) is a physical phenomenon that has a 


detrimental effect on the signal X(z) and we want to 
mitigate this effect. The most common example of this 
is the transmission of radio signals. The radio waves 
can bounce off obstacles situated between the trans- 
mitter and the receiver, resulting in the reception of 
several signals each with different delays relative to 
one another. As these signals are coherent, the resul- 
tant summation can exhibit interference effects that 
adversely affect the ability of the receiver to recover 
the information in the radio signal. By incorporating an 
adaptive filter in the receiver, the effects of the propa- 
gation can be reduced. This is known as channel equal- 
ization. There is an obvious problem with system inver- 
sion: the system G(z) maybe ill-conditioned or, indeed, 
it may not have an inverse. In such cases various reg- 
ularization techniques can be used; e.g., replace the 
least-squares problem w = arg min{||Aic - >11 2 } by a 
constrained one: w = arg min{||Au' — y || 2 : II it'll 2 ^ t} 
for some threshold t. In this case, however, the adap- 
tive filter H(z) cannot completely mitigate the effects 
of G(z). 

A more practical issue is the accuracy with which 
the system G(z) can be inverted. In general, G(z) can 
have poles as well as zeros. Its inverse therefore also 
has poles and zeros (see section 5). Note that even 
if G(z) had only zeros, which is often the case in 
practice, its inverse will have poles. This causes an 
issue because most adaptive filter algorithms are based 
on an FIR filter; i.e., they have only zeros. Although 
some adaptive HR filter algorithms have been devel- 
oped, they have significant bounded-input bounded- 
output stability issues. Here we are referring not to 
numerical stability but to system stability. An HR fil- 
ter will be systematically unstable if it has poles out- 
side of the unit circle. Most adaptive UR filter algo- 
rithms are unable to ensure that this does not hap- 
pen. Although this is mathematically well understood, 
it is very difficult to achieve in a signal-processing algo- 
rithm (often the solution of simultaneous nonlinear 
equations is required). Fortunately, it is possible to gen- 
erate an FIR filter with a response that is arbitrarily 
close to that of an HR filter, provided the FIR filter has 
sufficiently many coefficients. Thus, in practice, most 
system-inversion algorithms will use an adaptive FIR 
filter algorithm. A notable exception is, again, in the 
case of the transmission of radio signals. Recall that 
an HR filter uses previous outputs y(n - i) to calcu- 
late the current output yin) (see section 5). It is clear 
that trying to determine the filter coefficients a,; and 
hi is going to be difficult since we cannot calculate the 
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previous outputs yin - i) without already knowing the 
filter coefficients. However, in a modern radio the sig- 
nals are digital; that is, they take only discrete values, 
for example ±1. Thus, if the filter coefficients at and 
are correct, then yin) e {±1}. When the coefficients 
a* and fa, are incorrect but close to the correct values, 
the output of the filter yin) will in general not be ±1 
but we will find that yin) is closer to one of the allow- 
able values (i.e., ±1) than the other. If we denote the 
closest allowable value by din), then we may assume 
that yin) = din) is the correct answer. Following this 
procedure we can replace the feedback of the previ- 
ous outputs y(n - i) by the “decisions” d(n - i), i.e., 
y(n) = Xilo bix(n - i) + Yd=\ ajdin - i). The calcu- 
lation of the filter coefficients a, and b l is then more 
tractable. Furthermore, provided the filter coefficients 
can be initialized sufficiently close to their correct val- 
ues, any errors caused by incorrect “decisions” do not 
seem to cause any problems. This type of adaptive IIR 
filter is known as a decision-feedback equalizer. 

The decision-feedback equalizer makes “hard” deci- 
sions, in that d(n) e {±1}. An alternative approach is 
to make “soft” decisions. Here, din) = Qiy(n)), where 
the function Q maps yin) e K to d(n) e [-1,1] such 
that the distance between d(n) and + 1 or - 1, as appro- 
priate, is indicative of the “accuracy” of the decision. 
A soft-decision-feedback equalizer usually works bet- 
ter than a (hard-) decision-feedback equalizer. An obvi- 
ous choice of the function Q is the likelihood func- 
tion [V.ll]. Here, one would have to make assump- 
tions about the statistics of the transmitted signal and 
the receiver noise. Let y(n) = bin) + e (n), where 
bin) e [-1,1] is the correct value of the transmit- 
ted signal and sin) 6 R represents noise and resid- 
ual errors from incorrect decisions and filter coefficient 
values. An application of bayes’s theorem [V.ll] then 
gives P(b(n) \ y(n)) oc P(y(n) \ bin)). The term 
Piyin) | bin)) is easily calculated given the statis- 
tics of £(n) and is known as the likelihood function. 
The use of statistical estimation techniques leads to 
the field of Bayesian signal processing (see section 12). 
This approach has led to very powerful algorithms such 
as turbo equalization. Here, one attempts to estimate 
the coefficients of an equalizer but it is done iteratively 
using estimated probability density functions (pdfs) to 
capture the likely statistics of the parameters. 

10 Adaptive Beamforming (ABF) 

Another important signal-processing operation is the 
beamformer. Time-series filters (section 2) process a 


sin) 



Figure 4 A schematic of an antenna array beamformer 
where the elements are uniformly spaced along a straight 
line. Each of the Y-shaped symbols denotes an individual 
antenna. The received signals are weighted and combined 
to produce the array output signal yin). 

single time series and can vary the frequency content of 
the output signal. Whereas a filter performs a convolu- 
tion, yin) = y^ 0 Wixin-i), a beamformer processes 
separate signals, yin) = iUjXj(n). The input sig- 
nals xiin) for a beamformer come from physically sep- 
arate sensors, as illustrated in figure 4. A beamformer 
can vary the content of its output signal according to 
the direction of arrival of the input signals. To see this, 
note that the signal picked up by each sensor from a 
given source is just a delayed version of the source sig- 
nal. The delay will just be the time it takes for the radio 
frequency (RF) wave to travel from the source to the 
sensor. A modulated RF wave at time n can be writ- 
ten A(n)e -ltUM , where Ain) is the signal amplitude and 
(jo is the angular frequency. The signal received at a 
sensor is therefore Ain - d/c)e~ lu,( - n ~ d/c \ where d is 
the distance from the source to the sensor and c is the 
velocity of light in meters per sample time. If, to a good 
approximation, Ain - d/c) = Ain), the RF signal is 
called narrowband and the signal received by the sen- 
sor is just a phase-shifted version of the source signal: 
A(n)e~ lcun e la,d/c . The case when Ain - d/c) * Ain) is 
called broadband (see section 11). 

In the narrowband case, the beamformer can apply 
phase shifts to the signal from each sensor, via the 
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Wi, before summing them coherently. So, for example, 
the beamformer could apply phase shifts that cancel 
out the phase shifts caused by the relative propaga- 
tion delays expected for a source in a given direction, 
in which case any signal from that direction would be 
summed constructively, leading to a large-amplitude 
signal. Signals from other directions will tend to par- 
tially cancel in the summation and could even sum to 
zero. In fact, since the output of the beamformer is 
the sum of N terms, it can be shown that there are 
N - 1 directions for which the beamformer response is 
exactly zero. These are known as the nulls of the beam- 
former. Thus a beamformer can “filter” signals on the 
basis of direction of arrival. Note that there is a require- 
ment on the position of the sensors that is equivalent to 
the Nyquist sampling criterion: the sensors need to be 
close enough together to distinguish the highest spatial 
frequency present, otherwise one gets spatial aliasing. 

From the above it is clear that, for a signal from a 
given direction, the vector of signals from the sensors 
x(n) is proportional to a vector that is uniquely deter- 
mined by the direction of arrival. This vector is known 
as the steering vector a. The proportionality constant is 
the instantaneous value of the transmitted signal s (n), 
i.e., x(n) = s(n)a. A common architecture for a beam- 
former has the sensors equidistant from each other 
and in a straight line. This is known as a uniform lin- 
ear array, and it is illustrated in figure 4. In this case, 
the steering vectors look like sampled complex expo- 
nentials: e mtisln(0)/A , where n indexes the sensors, d 
is the sensor separation, 0 is the direction of arrival 
of the signal, and A is its wavelength. In such circum- 
stances, the term cisin(0)/A is known as the spatial 
frequency. Spatial frequency therefore has a sinusoidal 
dependency on the direction of arrival of the signal. 
In general, the steering vectors do not have this simple 
structure, but references to “spatial frequency” can still 
be found in the literature and, in general, beamformers 
are often referred to as spatial filters. There has been 
a lot of work on the design of fixed beamformers. As 
might be expected, there is much commonality with the 
design of fixed digital filters (see section 5). Flowever, 
the emphasis here is mostly on controlling the response 
to the wanted signal (the main lobe) and on rejecting 
unwanted signals (the side lobes). 

Like an adaptive filter, an adaptive beamformer cal- 
culates its coefficients (or “weights”) Wi based on the 
collected data. The mathematics is virtually identical 
to that of an adaptive filter (i.e., the least-squares solu- 
tion to Xw = y). Here, too, techniques such as time 


recursion (see section 8) are used to reduce the com- 
putation required, but the data matrix X is no longer 
Toeplitz, so there are no “fast” ABF algorithms. Another 
difference is the expected signal content. In an adap- 
tive filter scenario one usually sees a significant (con- 
tiguous) range of frequencies present in the signal. In 
a beamforming application, one expects to see only a 
few discrete directions of arrival present. This leads to 
some algorithms that are specific to the beamforming 
application. 

In some situations, the spectrum of the data covari- 
ance matrix X* X will break into two distinct sets: a set 
of large eigenvalues and a set of small ones. The large 
eigenvalues, or, more specifically, the associated eigen- 
vectors, correspond to a subspace that contains infor- 
mation about the signals, while the small eigenvalues 
correspond to noise. This allows, for example, the noise 
to be rejected by means of an orthogonal projection of 
the data onto the “signal” subspace. This technique is 
in fact an example of principal-component analy- 
sis [IV. 17 §4] (PCA). In addition, by identifying those 
steering vectors that are orthogonal to the “noise” sub- 
space, estimates of the direction of arrival of the sig- 
nals can be obtained. This is the basis of the MUSIC 
direction-of-arrival estimation algorithm. 

In the above example, the eigenvectors correspond- 
ing to the largest eigenvalues span a subspace that 
contains the signals. Note, however, that the eigenvec- 
tors do not correspond to steering vectors and fur- 
ther processing is required to separate the signals from 
one another. A simple approach is to set up a con- 
strained least-squares problem: w = argmin{||XtP - 
y || : w*c = 1}. Here, c is chosen so that the response 
of the beamformer to a signal from a given direction 
is unity; i.e., for an input signal with steering vector a, 
x = s(n)a, we want w*x = s(n) so that w* a = 1 and 
hence c = a. By trying all directions of arrival (or at 
least a finite set), the individual signals can be recov- 
ered. This technique is known as minimum-variance 
distortionless-response (MVDR) beamforming. Note that 
an MVDR beamformer is a mixture of an adaptive beam- 
former (the least-squares minimization) and a fixed 
one (the constraint). It is often advantageous to extend 
this concept and add further constraints to control 
other features of the beampattern, e.g., a derivative con- 
straint to control the width of the main lobe in the 
desired “look direction.” One issue with MVDR beam- 
forming is inaccuracies in the steering vectors either 
due to practical issues (e.g., calibration measurements) 
or due to the finite search set. If the steering vector is 
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not accurate, then the least-squares minimization can 
adjust the beampattern to remove the desired signal. 
Various approaches to mitigating this effect have been 
investigated, the gradient constraint mentioned above 
being one of them. 

An alternative, non-search-based, technique for re- 
covering the signals is based on independent-compo- 
nent analysis (ICA). This is an extension of PCA. If we 
write X = SA, where S is a matrix of signal time series 
and A is a matrix of steering vectors, it can be shown 
that PCA results in a set of orthonormal basis vectors 
Y = SQ, where Q is a unitary matrix. If the signals S are 
not Gaussian, then S can be recovered from Y by find- 
ing a unitary transformation U that makes the columns 
of YU statistically independent (U = QP , where P is a 
permutation matrix). ICA is also known as blind signal 
separation since the matrix of steering vectors A is not 
known a priori. 

1 1 Broadband ABF and Multichannel Filtering 

We have seen that temporal filtering and spatial fil- 
tering are useful practical operations. In some appli- 
cations it is necessary to apply both forms of filter- 
ing simultaneously, i.e., joint temporal and spatial fil- 
tering. Examples of this are sonar, space-time adaptive 
processing (STAP) radar, and multiple-input multiple- 
output communications. The rationale for using both 
forms of filtering is the need to control the frequency 
content of the signal as well as its direction of arrival. 
This technique is called broadband beamforming by 
people familiar with conventional beamforming. This 
is because beamforming relies on applying time delays 
to the received signals. In conventional beamforming, 
the signals have narrow bandwidths and time delays 
are equivalent to phase shifts. When the signals are 
broadband, interpolation filters are required to imple- 
ment time delays that are not integer multiples of the 
sample period (see section 1). Therefore, instead of 
the beamformer output being y(n) = Silo WiXi(n), 
the products WiXi(n) are replaced by filters: y(n) = 
Silo Xf=o w ij x i( n - j)- The oldest example of broad- 
band ABF is found in passive sonar. Here, acous- 
tic underwater signals are detected using arrays of 
hydrophones. Those familiar with time series filtering 
refer to joint temporal and spatial filtering as multi- 
channel filtering. A multichannel filter is the same as 
the (scalar-valued) filter outlined in section 7 but with 
vector-valued time series instead. 

Another example of broadband ABF is STAP radar, 
which is effectively a combination of a phased-array 


radar and a pulse-Doppler radar. As we have seen, the 
former radar is able to separate the radar echoes based 
on their direction of arrival. The latter can separate 
the radar echoes based on the velocity of the reflector 
(i.e., aircraft) by exploiting the Doppler effect. The radar 
echo from a moving aircraft will be Doppler shifted 
by virtue of its motion. This frequency shift can be 
detected by sending out multiple radar pulses and pro- 
cessing the resulting echoes as a time series. A fre- 
quency filter can then separate the echoes based on the 
velocity of the aircraft. One advantage of pulse-Doppler 
radar is that the echo from a low-flying aircraft can be 
separated from that due to reflectors on the ground 
since either the latter will be stationary or it will be mov- 
ing very slowly with respect to the aircraft. By combin- 
ing ABF and pulse-Doppler processing, STAP radar can 
separate the echoes based on both direction of arrival 
and velocity. In effect, it creates a two-dimensional fil- 
ter in a space with direction of arrival on one axis and 
velocity on the other. 

The standard approach to processing “space-time” 
data is to use the FFT to transform the data into a 
time-frequency space and then process each frequency 
slice separately using standard spatial algorithms. This 
has the advantage of computational efficiency as there 
is a “dimensionality curse” associated with moving 
from one dimension to two. The use of the FFT, which 
is computationally efficient, being O (A/Tog A), trans- 
forms the two-dimensional problem to multiple one- 
dimensional ones. However, strictly speaking, the one- 
dimensional problems are not independent, and treat- 
ing them as such incurs an approximation error. For- 
tunately, this error tends to reduce as the order of the 
FFT increases. An alternative approach to overcoming 
the dimensionality curse is to attempt to exploit the 
structure of the problem (see the discussion of fast 
adaptive filters in section 8). In the case of ABF, the 
data matrix is “block Toeplitz” rather than Toeplitz, 
but a similar computational reduction can nevertheless 
be obtained by exploiting related multichannel forward 
and backward linear prediction problems. 

A more recent approach to this problem involves a 
generalization of matrix algorithms over the complex 
field to related algorithms over the ring of polynomials. 
In this context, a polynomial matrix (PM) A(z) e C nxm 
is simply an n x m matrix whose elements are poly- 
nomials over C. A(z ) represents an n x m FIR filter 
and corresponds to its z-transform, except that it is 
treated as a polynomial in the indeterminate variable 
z _1 rather than a function of z to be evaluated at a 
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point in the complex plane. The paraconjugate of a PM 
A(z) 6 C nxm is denoted by A(z) = A*(l/z), where 
the asterisk denotes complex conjugation of the poly- 
nomial coefficients. A parasymmetric PM M(z) e C nxn 
is one that satisfies M(z) = M(z), while a paraunitary 
PM H(z) e C nxn (which corresponds to a multichannel 
allpass filter) satisfies H(z)H(z) = H(z)H(z) = I. As 
usual, I denotes the unit matrix. 

Algorithms have recently been developed for approx- 
imating the eigenvalue decomposition (EVD), the sin- 
gular-value decomposition [11.32] (SVD), and the QR 
decomposition of polynomial matrices. For example, 
the EVD of a parasymmetric PMM(z) e C nxn (this is 
a polynomial eigenvalue decomposition (PEVD)) can be 
computed as H(z)M(z)H(z) ~ D(z), where H(z ) 6 
C nxn is paraunitary and D(z ) 6 C nxn is diagonal. The 
exact decomposition does not exist for H(z ) 6 C nxn , 
but the PM solution can lie arbitrarily close to a cor- 
responding matrix of continuous functions for which 
it does exist. The PEVD defined above is clearly a gen- 
eralization of the conventional matrix EVD to which it 
reduces for a PM of order zero. Just as the matrix EVD 
(or SVD) plays a fundamental role in the separation of 
signal and noise subspaces for narrowband ABF, so the 
PEVD (or polynomial SVD) may be used in broadband 
ABF. However, the range of potential applications for 
PM decomposition techniques in general is very much 
wider. 

12 Bayesian Signal Processing/ 
Parameter Es t imation 

There are two main approaches to developing signal- 
processing algorithms: deterministic and statistical. 
The algorithms for adaptive filtering and ABF given 
above are deterministic, in the sense that we obtain a 
set of parameters (e.g., filter coefficients orbeamformer 
weights) by solving a least-squares minimization prob- 
lem. These algorithms can, however, be seen as spe- 
cial cases of more general parameter-estimation algo- 
rithms. The filter coefficients and beamformer weights 
can be seen as parameters that are to be estimated 
from the received data. Furthermore, when these esti- 
mated parameters are used in a filter or beamformer, 
one obtains an estimate of some signal of interest. For 
example, the MVDR beamformer recovers an estimate 
of the signal at a given location. 

In general, parameter estimation is seen as a statis- 
tical estimation problem. The least-squares algorithms 
result when one has a linear problem with Gaussian 


noise. The most popular signal-processing approach 
to parameter estimation is based on bayes's theorem 
[V.ll]. If a variable x is dependent on a parameter 0, 
then P{9 | x) oc P(x \ 0)P(0). The pdf P(x \ 0) is 
known, since it comes from the physics of the situation, 
and it encodes the dependency that x has on 0. Thus, 
on receipt of the measurement x, the a priori pdf P(9) 
can be updated to the a posteriori pdf P(0 \ x) merely 
by multiplying the former by P(x ] 9) (known as the 
likelihood function). However, such a product will gener- 
ally not have a closed-form expression. An exception to 
this rule is linear problems where the noise is Gaussian. 
In this case all variables are Gaussian and closed-form 
expressions for the means and variances are easy to 
find. In general, however, the pdfs have be represented 
in some computationally tractable form. One option 
is to model an arbitrary pdf as the sum of Gaussians 
(known as a Gaussian mixture). Another is to model 
the pdf by a discrete probability distribution. The lat- 
ter approach is known as particle filtering. These mod- 
eling techniques are powerful as they allow nonlinear 
and non-Gaussian problems to be tackled. However, as 
with any modeling problem, there are issues to do with 
parameter choice, e.g., the number of Gaussians in the 
mixture, and the number of bins in the discrete-valued 
distribution. A particularly important issue is loss of 
resolution. Bayes’s theorem required us to multiply the 
a priori pdf by the likelihood function. If the modeled 
a priori pdf has a zero value for some value of the vari- 
able, then the a posteriori pdf will also have a zero 
value for this value. Techniques are therefore needed to 
ensure that the modeled pdf has a zero value only when 
this is justified by the data. Note that these modeling 
approaches tend to require more computation than an 
algorithm based on closed-form expressions. 

An important signal-processing area that is related to 
parameter estimation is detection theory. Here, we wish 
to decide if an event has taken place based on an obser- 
vation. An example of this is the detection of aircraft 
using radar: the signal received by the radar could con- 
sist of the echo from the aircraft plus noise or it could 
just be noise. What is required is some optimal test that 
can confirm the presence of the aircraft. Typically, the 
radar signal is compared with a threshold. If the signal 
exceeds the threshold, then the aircraft is said to have 
been detected. Clearly, there could be false detection 
(false positives) and missed detection (false negatives). 
Detection theory uses statistical arguments to calculate 
the threshold with given properties such as a constant 
false alarm rate. 



IV. 36. Information Theory 


545 


13 Tracking 

As mentioned in the previous section, adaptive filters 
and beamformers can be seen as devices for estimating 
unknown parameters. In this case, however, the param- 
eters are constants. If the unknown parameters are time 
varying, the problem is one of tracking. 

Since the estimation of N parameters requires at 
least N pieces of data, it is not possible to estimate 
more than one arbitrary time-varying parameter from 
a single time series. It is therefore conventional to 
assume that the parameters evolve in a known man- 
ner; for example, 0(n) = F(0(n - 1) | <P), where <P are 
(known) parameters of the function F. Given this model 
for the time evolution of the parameter, it is then pos- 
sible to formulate a parameter-estimation algorithm. 
As with adaptive filtering and beamforming, one can 
take a deterministic (i.e., least-squares) approach or a 
Bayesian approach. In the former case one ends up with 
the well-known Kalman filter, which is optimum for lin- 
ear systems and Gaussian noise. In the latter case one 
ends up with a more powerful algorithm but with the 
computational issues mentioned above. 
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1 “A Mathematical Theory of Communication” 

Rarely does a scientific discipline owe its existence to 
a single paper. Authored in 1948 by Claude Shannon 
(1916-2001), “A mathematical theory of communica- 
tion” is the Magna Carta of the information age and 
information theory’s big bang. Using the tools of prob- 
ability theory, it formulates the central optimization 
problems in data compression and transmission, and 
finds the best achievable performance in terms of the 
statistical description of the information sources and 
communication channels by way of information mea- 
sures such as entropy and mutual information. After 
a glimpse at the state of the art as it was in 1948, we 
elaborate on the scope of Shannon's masterpiece in the 
rest of this section. 

1.1 Communication Theory before the Big Bang 

Motivated by the improvement in telegraphy trans- 
mission rate that could be achieved by replacing the 
Morse code by an optimum code, both Nyquist (1924) 
and Hartley (1928) recognized the need for a measure 
of information devoid of “psychological factors” and 
put forward the logarithm of the number of choices 
as a plausible alternative. Kupfmtiller (1924), Nyquist 
(1928), and Kotel’nikov (1933) studied the maximum 
telegraph signaling speed sustainable by band-limited 
linear systems at a time when Fourier analysis of sig- 
nals was already a standard tool in communication 
engineering. Inspired by the telegraph studies, Hart- 
ley put forward the notion that the “capacity of a 
system to carry information” is proportional to the 
time -bandwidth product, a notion further elaborated 
by Gabor (1946). However, those authors failed to grap- 
ple with the random nature of both noise and the 
information-carrying signals. At the same time, the idea 
of using mathematics to design linear filters for com- 
batting additive noise optimally had been put to use 
by Kolmogorov (1941) and Wiener (1942) for minimum 
mean-square error estimation and by North (1943) for 
the detection of radar pulses. 

Communication systems such as FM and PCM in the 
1930s and spread spectrum in the 1940s had opened 
up the practical possibility of using transmission band- 
width as a design parameter that could be traded off for 
reproduction fidelity and robustness against noise. 
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1.2 The Medium 

In the title of Shannon's paper, “communication” refers 
to 

• communication across space, namely, informa- 
tion-transmission systems like radio and televi- 
sionbroadcasting, telephone wires, coaxial cables, 
optical fibers, microwave links, and wireless tele- 
phony; and 

• communication across time, namely, information- 
storage systems, which typically employ magnetic 
(tape and disks), optical (CD, DVD, and BD), and 
semiconductor (volatile and flash) media. 

Although, at some level, all transmission and storage 
media involve physical continuously variable analog 
quantities, it is useful to model certain media such 
as optical disks, computer memory, or the Internet as 
digital media that transmit or record digital signals 
(zeros/ones or data packets) with a certain reliability 
level. 

1.3 The Message 

The message to be stored or transmitted may be 

• analog (such as sensor readings, audio, images, 
video, or, in general, any message intended for the 
human ear/eye) or 

• digital (such as text, software, or data files). 

An important difference between analog and digital 
messages is that, since noise is unavoidable in both 
sensing and transmission, it is impossible to recon- 
struct exactly the original analog message from the 
recorded or transmitted information. Lossy reproduc- 
tion of analog messages is therefore inevitable. Even 
when, as is increasingly the case, sensors of analog 
signals output quantized information, it is often con- 
ceptually advantageous to treat those signals as analog 
messages. 

1.4 The Coat of Arms 

Shannon's theory is a paragon of e pluribus unum. 
Indeed, despite the myriad and diversity of commu- 
nication systems encompassed by information theory, 
its key ideas and principles are all embracing and are 
applicable to any of them. 

Reproduced from Shannon’s paper, figure 1 encom- 
passes most cases (see section 9) of communication 
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Figure 1 A schematic of a general communication system 
(this is figure 1 in “A mathematical theory of communica- 
tion”). 

across time or space between one sender and one 
destination. 

The purpose of the encoder (or transmitter , in fig- 
ure 1) is to translate the message into a signal suit- 
able for the transmission or storage medium. Con- 
versely, the decoder (or receiver, in figure 1) converts 
the received signal into an exact or approximate replica 
of the original message. 

The communication medium that connects the trans- 
mitter to the receiver is referred to as the channel. 
Several notable examples, classified according to the 
various combinations of the nature of message and 
medium, are listed below. 

Analog message, analog medium. Radio broadcasting 
and long-distance telephony were the primary appli- 
cations of the first analog modulation systems, such 
as AM, SSB, and FM, developed in the early twenti- 
eth century. With messages intended for the ear/eye 
and the radio frequency spectrum as the medium, 
all current systems for radio and television (wireless) 
broadcasting are also examples of this case. However, 
in most modern systems (such as DAB and HDTV) the 
transmitter and receiver perform an internal inter- 
mediate conversion to digital, for reasons that are 
discussed in section 4. 

Analog message, digital medium. This classification 
includes the audio compact disc, MP3, DVD, and 
Voice over Internet Protocol (VoIP). So “digital audio” 
or “digital video” refers to the medium rather than 
the message. 

Digital message, analog medium. The earliest exam- 
ples of optical and electrical systems for the trans- 
mission of digital information were the wired tele- 
graph systems invented in the first half of the nine- 
teenth century, while the second half of the cen- 
tury saw the advent of Marconi’s wireless telegraph. 
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Figure 2 A data-compression system. 



Figure 3 A data-transmission system. 


Other examples developed prior to 1948 include tele- 
type, fax, and spread spectrum. The last four decades 
of the twentieth century saw the development of 
increasingly fast general-purpose modems to trans- 
mit bit streams through analog media such as the 
voice-band telephone channel and radio frequency 
bands. Currently, modems that use optical, DSL, and 
CATV media to access the Internet are ubiquitous. 
Digital message, digital medium. This classification 
includes data storage in an optical disk or flash 
memory. 

Whether one is dealing with messages, channel 
inputs, or channel outputs, Shannon recognized that 
it is mathematically advantageous to view continuous- 
time analog signals as living in a finite-dimensional vec- 
tor space. The simplest example is a real-valued signal 
of bandwidth B and (approximate) duration T, which 
can be viewed as a point in the Euclidean space of 
dimension 2 BT. To that end, Shannon gave a particu- 
larly crisp version of the sampling theorem, precursors 
of which had been described by E. Whittaker (1915), 
J. Whittaker (1929), and Kotelnikov (1933), who discov- 
ered how to interpolate losslessly the sampled values 
of band-limited functions. 

Three special cases of figure 1, dealt with in each of 
the next three sections, merit particular attention. 

2 Lossless Compression 

Although communication across time or space is al- 
ways subject to errors or failures, it is useful to con- 
sider the idealized special case of figure 1 shown in 
figure 2, in which there is no channel and the input to 
the decoder is a digital sequence equal to the encoder 
output. This setup, also known as source coding, mod- 
els the paradigm of compression in which the encoder 
acts as the compressor and the decoder acts as the 
decompressor. The task of the encoder is to remove 
redundancy from the message, which can be recov- 
ered exactly or approximately at the decoder from the 
compressed data itself. 

Lossless, or reversible, conversion is possible only if 
the message is digital. Morse, Huffman, TIFF, and PDF 
are examples of lossless compression systems, where 


message redundancy (unequal likelihoods of the vari- 
ous choices) is exploited to compact the data by assign- 
ing shorter binary strings to more likely messages. As 
we discuss more precisely in section 6, the goal is to 
obtain a compression/decompression algorithm that 
generates, on average, the shortest encoded version of 
the message. 

If the source is stationary, universal data compres- 
sors exploit its redundancy without prior knowledge 
of its probabilistic law. Found in every computer oper- 
ating system (e.g., ZIP), the most widely used universal 
data compressors were developed by Lempel and Ziv 
between 1976 and 1978. 

3 Lossy Compression 

Depending on the nature of the message, we can dis- 
tinguish two types of lossy compression. 

Analog-to-digital. Early examples of analog-to-digital 
coding (such as the vocoder and pulse-code modula- 
tion (PCM)) were developed in the 1930s. The vocoder 
was the precursor to the speech encoders used in cel- 
lular telephony and in VoIP, while PCM remains in 
widespread use in telephony and in the audio com- 
pact disc. The conceptually simplest analog-to-digital 
compressor, used in PCM, is the scalar quantizer, 
which partitions the real line in 2 k segments, each of 
which is assigned a unique A: -bit label, jpeg [VII. 7 §5] 
and MPEG are contemporary examples of lossy com- 
pressors for images and audio/video, respectively. 
Even if the inputs to those algorithms are Unite- 
precision numbers, their signal processing treats 
them as real numbers. 

Digital-to-digital. Even in the case of digital messages, 
one may be willing to tolerate a certain loss of infor- 
mation for the sake of economy of transmission 
time or storage space (e.g., when emailing a digital 
image or when transmitting the analog-to-digitally 
compressed version of a sensor reading). 

4 Data Transmission 

Figure 3 depicts the paradigm, also known as channel 
coding, in which the message input to the encoder is 
incompressible or nonredundant, in the sense that it 
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is chosen equiprobably from a finite set of alternatives 
(such as fab coin flips or “pure” bits, i.e., independent 
binary digits equally likely to be 0 or 1). The task of 
the encoder is to add redundancy to the message in 
order to protect it from channel noise and facilitate its 
recovery by the decoder from the noisy channel out- 
put. In general, this is done by assigning codewords to 
each possible message, which are different enough to 
be distinguishable at the decoder as long as the noise 
is not too severe. For example, in the case of a digi- 
tal medium the encoder may use an error-correcting 
code that appends redundant bits to the binary mes- 
sage string. In the case of an analog medium such as a 
telephone channel, the codewords are continuous-time 
waveforms. Based on the statistical knowledge of the 
channel and the codebook (assignment of messages to 
codewords) used by the encoder, the decoder makes an 
intelligent guess about the transmitted message. 

Remarkably, Shannon predicted the performance of 
the best possible codes at a time when very few error- 
correcting codes were known. Hamming, a coworker 
at Bell Laboratories, had just invented his namesake 
code (see applied combinatorics and graph theory 
[IV. 3 7 §4]) that appends three parity-check bits to every 
block of four information bits in a way that makes all 
sixteen codewords differ from each other in at least 
three positions. Therefore, the decoder can correct any 
single error affecting every encoded block of seven 
bits. 

5 Compression/Transmission 

Figure 4 illustrates another special case of figure 1 in 
which the transmitter consists of the source encoder, 
or compressor, followed by the channel encoder, and 
the receiver consists of the channel decoder followed 
by the source decoder, or decompressor. This archi- 
tecture capitalizes on the solutions found in the spe- 
cial cases in sections 2, 3, and 4. To that end, in 
the scheme shown in figure 4 the interfaces between 
source and channel encoders, and between channel 
and source decoders, are digital regardless of the mes- 
sage or medium. Inspired by the teachings of informa- 
tion theory, in which the bit emerges as the univer- 
sal currency, the modular design in figure 4 is preva- 
lent in most modern systems for the transmission 
of analog messages through either digital or analog 
media. It allows the source encoding/decoding system 
to be tailored particularly to the message, disregarding 
the nature of the channel. Analogously, it allows the 



Figure 4 A separate compression/transmission system. 

channel encoding/decoding system to be focused on 
the reliable transmission of nonredundant bits by com- 
batting the channel noise disregarding the nature of 
the original message. In this setup, the source encoder 
removes redundancy from the message in a way that 
is tuned to the information source, while the channel 
encoder adds redundancy in a way that is tuned to the 
channel. Under widely applicable sufficient conditions, 
such modular design is asymptotically optimal (in the 
sense of section 6) in the limit in which the length of 
the message goes to infinity and when both source and 
channel operate in the ergodic regime. 

6 Performance Measures 

The basic performance measures depend on the type 
of system under consideration. 

Lossless compression. The compression rate (in bits 
per symbol) is the ratio of encoded bits to the number 
of symbols in the digital message. 

Lossy compression. The quality of reproduction is 
measured by a distortion function of the original and 
reproduced signals, e.g., in the case of analog signals, 
the mean-square error (energy of the difference sig- 
nal), and in the case of binary messages, the bit error 
rate. The rate (in bits per second, or per symbol) of 
a lossy compression system is the ratio of encoded 
bits to the duration of the message. 

Data transmission. For a given channel and assuming 
that the message is incompressible, the performance 
of a data- transmission system is determined by the 
rate and the error probability. The rate (in bits per 
second, or per symbol) is the ratio of message dura- 
tion to the time it takes to send it through the chan- 
nel. Depending on the application, the reliability of 
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the transmission is measured by the bit error rate or 
by the probability that the entire message is decoded 
correctly. 

Joint compression/transmission. In the general case, 
the rate is measured as in the data- transmission case, 
with reliability measured either by a distortion mea- 
sure or by the probability that the entire message is 
decoded correctly, depending on the nature of the 
message and the application. 

7 Fundamental Limits 

Instead of delving into the analysis and design of 
specific transmission systems or codes, the essence 
of Shannon’s mathematical theory is to explore the 
best performance that an optimum encoder/decoder 
system (simply referred to as the code) can achieve. 
Information theory obtains fundamental limits with- 
out actually deriving the optimal codes, which are 
often unknown. For the three problems formulated by 
Shannon, the fundamental limits are as follows. 

Lossless compression: the minimum achievable com- 
pression rate. 

Lossy compression: the rate-distortion function, which 
is the minimum compression rate achievable as a 
function of the allowed average level of distortion. 
Data transmission: the channel capacity, defined as 
the maximum transmission rate compatible with van- 
ishing error probability. Capacity is often given in 
terms of channel parameters such as transmitted 
power. Before Shannon’s paper, the common wisdom 
was that vanishing error probability would necessar- 
ily entail vanishing rate of information transmission. 

The fundamental limits are very useful to the engi- 
neer because they offer a comparison of the perfor- 
mance of any given system with that ultimately achiev- 
able. Although, in Shannon's formulation, the growth of 
computational complexity as a function of the message 
size is not constrained in any way, decades of research 
on the constructive side of compression and transmis- 
sion have yielded algorithms that can approach the 
Shannon limits with linear complexity. Often, informa- 
tion theory leads to valuable engineering conclusions 
that reveal that simple (or modular) solutions may per- 
form at or near optimum levels. For example, as we 
mentioned, there is no loss in achievable performance 
if one follows the principle of separate compression/ 
transmission depicted in figure 4. Fundamental limits 
can be, and often are, used to sidestep the need for 


cumbersome analysis in order to debunk performance 
claims made for a given system. 

The fundamental limits turn out to depend crucially 
on the duration of the message. Since Shannon’s 1948 
paper, information theory has focused primarily, but 
not exclusively, on the fundamental limits in the regime 
of asymptotically long messages. By their very nature, 
the fundamental limits for a given source or channel 
are not technology dependent, and they do not become 
obsolete with improvements in hardware/software. On 
the contrary, technological advances pave the way 
for the design of coding systems that approach the 
ideal fundamental limits increasingly closely. Although 
the optimum compression and transmission systems 
are usually unknown, the methods of proof of the 
fundamental limits often suggest features that near- 
optimum practical communication systems ought to 
have, thereby offering design guidelines to approach 
the fundamental limits. Shannon’s original proof of his 
channel coding theorem was one of the first nontrivial 
instances of the probabilistic method, now widely used 
in discrete mathematics; to show the existence of an 
object that satisfies a certain property it is enough to 
find a probability distribution on the set of all objects 
such that those satisfying the property have nonzero 
probability. In his proof, Shannon computed an upper 
bound to the error probability averaged with respect 
to an adequately chosen distribution on the set of all 
codes; at least one code must have error probability not 
exceeding the bound. 

8 Information Measures 

The fundamental performance limits turn out to be 
given in terms of so-called information measures, 
which have units such as bits. In this section we list 
the three most important information measures. 

Entropy: a measure of the randomness of a discrete 
distribution Px defined on a finite or countably infi- 
nite alphabet JA, defined as 

In the limit as n — oo, a stationary ergodic random 
source (AT ,.. .,X n ) can be losslessly encoded at its 
entropy rate 

lim -H( Xi,...,X n ), 

n— oo Yi 

a limit that is easy to compute in the case of Markov 
chains. In the simplest case, asymptotically, n flips 
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of a coin with bias p can be compressed losslessly at 
any rate exceeding h(p) bits per coin flip with 

h(p) = plog — + (1 - p)log ( — ^ — ), 
p VI — pi 

which is the entropy of the biased coin source. The 
ubiquitous linear-time Lempel-Ziv universal data- 
compression algorithms are able to achieve, asymp- 
totically, the entropy rate of ergodic stationary 
sources. Therefore, at least in the long run, univer- 
sality incurs no penalty. 

Relative entropy: a measure of the dissimilarity be- 
tween two distributions P and Q defined on the same 
measurable space (JT, ff), defined as 

D(P||Q) = Jlog(^)dP. 

Relative entropy plays a central role not only in infor- 
mation theory but also in the analysis of the ability to 
discriminate between data models, and in particular 
in large-deviation results, which explore the exponen- 
tial decrease (in the number of observations) of the 
probability of very unlikely events. Specifically, if n 
independent data samples are generated with prob- 
ability distribution Q, the probability that they will 
appear to be generated from a distribution in some 
class P behaves as 

exp i-n inf P>(Pj|Q)V 

V p e p / 

Relative entropy was introduced by Kullback and 
Leibler in 1951 with the primary goal of extending 
Shannon’s measure of information to nondiscrete 
cases. 

Mutual information: a measure of the dependence 
between two (not necessarily discrete) random vari- 
ables X and Y given by the relative entropy between 
the joint measure and the product of the marginal 
measures: 

I(X;Y) = D(PxyWPxxPy). 

Note that J(A; A) = H(X) if X is discrete. 

For stationary channels that behave ergodically, the 
channel capacity is given by 

C = lim — max I (Xi X n ;Y 1 ,...,Y n ), 

n— oo Yl 

where the maximum is over all joint distributions 
of (X\ , . . . , X n ) , and ( Yj , . . . , Y n ) are the channel 
responses to (A'i , . . . , X n ). If the channel is stationary 
memoryless, then the formula boils down to 

C = ma xI(X\Y). 


The capacity of a channel that erases a fraction 5 of 
the codeword symbols (drawn from an alphabet JT) 
is 

C = (1-5) log |JT|, 

as long as the location of the erased symbols is 
known to the decoder and the nonerased symbols 
are received error free. In the case of a binary channel 
that introduces errors independently with probabil- 
ity 5, the capacity is given by 

C = 1 - h(S), 

while in the case of a continuous-time additive Gauss- 
ian noise channel with bandwidth B, transmission 
power P, and noise strength N, the capacity is 

C = .Blog (l + 4) bits per second, 

V BN ) 

a formula that dispels the pre-1948 notion that the 
information-carrying capacity of a communication 
channel is proportional to its bandwidth and that is 
reminiscent of the fact that in a cellular phone the 
stronger the received signal the faster the download. 
In lossy data compression of a stationary ergodic 
source (Xi , Xi , . . , ), the rate compatible with a given 
per-sample distortion level d under a distortion mea- 
sure d\ JA 2 — [0, oo] is given by 

R(d) = lim — min/(A 1 , . . .,X n \ Yj, . . . , Y n ), 

n - oo pi 

where the minimum is taken over the joint distribu- 
tion of source X n and reproduction Y n , with given 
Pyn , and such that 

1 " 

-Xe/(X tl Yi)<d. 

n 

i=i 

For stationary memoryless sources, just as for capac- 
ity we obtain a “single-letter” expression R(d ) = 
min/(A; Y). 

It should be emphasized that the central concern 
of information theory is not the definition of infor- 
mation measures but the theorems that use them to 
describe the fundamental limits of compression and 
transmission. However, it is rewarding that entropy, 
mutual information, and relative information, as well 
as other related measures, have found applications in 
many fields beyond communication theory, including 
probability theory, statistical inference, ergodic theory, 
computer science, physics, economics, life sciences, 
and linguistics. 
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9 Beyond Figure 1 

Work on the basic paradigm in figure 1 continues to 
this day, not only to tackle source and channel models 
inspired by new applications and technologies but in 
furthering the basic understanding of the capabilities 
of coding systems, particularly in the nonasymptotic 
regime. However, in order to analyze models of interest 
in practice, many different setups have been studied 
since 1948 that go beyond the original. We list a few of 
the ones that have received the most attention. 

Feedback. A common feature of many communication 
links is the availability of another communication 
channel from receiver to transmitter. In what way 
can knowledge of the channel output aid the trans- 
mitter in a more efficient selection of codewords? 
In 1956 Shannon showed that, in the absence of 
channel memory, capacity does not increase even 
if the encoder knows the channel output instanta- 
neously and noiselessly. Nevertheless, feedback can 
be quite useful to improve transmission rate in the 
nonasymptotic regime and in the presence of channel 
memory. 

Separate compression of dependent sources of infor- 
mation. Suppose that there is one decompressor that 
receives the encoded versions of several sources pro- 
duced by individual compressors. If, instead, a single 
compressor had access to all the sources, it could 
exploit the statistical dependence among them to 
encode at a rate equal to the overall entropy. Sur- 
prisingly, in 1973 Slepian and Wolf showed that even 
in the completely decentralized setup the sum of 
the encoded rates can be as low as in the central- 
ized setting and still the decompressor is able to cor- 
rectly decode with probability approaching 1. In the 
lossy setting the corresponding problem is not yet 
completely solved. 

Multiple-access channel. If, as in the case of a cellular 
wireless telephony system, a single receiver obtains 
a signal with mutually interfering encoded streams 
produced by several transmitters, there is a trade-off 
among the achievable rates. The channel capacity is 
no longer a scalar but a capacity region. 

Interference channel. As in the case of a wired tele- 
phone system subject to crosstalk, in this model 
there is a receiver for each transmitter, and the signal 
it receives not only contains the information trans- 
mitted by the desired user but is contaminated by 
the signals of all other users. It does not reduce to 
a special case of the multiple-access setup because 


each receiver is required to decode reliably only the 
message of its desired user. 

Broadcast channel. A single transmitter sends a code- 
word, which is received by several geographically 
separated receivers. Each receiver is therefore con- 
nected to the transmitter by a different communica- 
tion channel, but all those channels share the same 
input. If the broadcaster intends to send different 
messages to the various destinations, there is again 
a trade-off among the achievable rates. 

Relay channel. The receiver obtains both a signal from 
the transmitter and a signal from a relay, which 
itself is allowed to process the signal it receives from 
the transmitter in any way it wants. In particular, 
the relay need not be able to fully understand the 
message sent by the transmitter. 

Inspired by various information technologies, a num- 
ber of information-theoretic problems have arisen that 
go beyond issues of eliminating redundancy (for com- 
pression) or adding redundancy (for transmission in 
the presence of noise). Some examples follow. 

Secrecy. Simultaneously with communication theory, 
Shannon established the basic mathematical theory 
of cryptography and showed that iron-clad privacy 
requires that the length of the encryption key be 
as long as that of the message. Most modern cryp- 
tographic algorithms do not provide that level of 
security; they rely on the fact that certain compu- 
tational problems, such as integer factorization, are 
believed to be inherently hard. A provable level of 
security is available using an information-theoretic 
approach pioneered by Wyner (1975), which guar- 
antees that the eavesdropper obtains a negligible 
amount of information about the message. 

Random number generation for system simulation. 
Random processes with prescribed distributions can 
be generated by a deterministic algorithm driven by a 
source of random bits. A key quantity that quantifies 
the “complexity” of the generated random process is 
the minimal rate of the source of random bits neces- 
sary to accomplish the task. The resolvability of a sys- 
tem is defined as the minimal randomness required 
to generate any desired input so that the output dis- 
tributions are approximated with arbitrary accuracy. 
In 1993 Han and Verdu showed that the resolvability 
of a system is equal to its channel capacity. 
Minimum description length. In the 1960s Kolmogo- 
rov and others took a nonprobabilistic approach to 
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the compression of a message, which, like universal 
lossless compression, uses no prior knowledge: the 
algorithmic complexity of the message is the length 
of the shortest program that will output the message. 
Although this notion is useful only asymptotically, 
it has important links with information theory and 
has had an impact in statistical inference, primarily 
through the minimum description length statistical 
modeling principle put forward by Rissanen in 1978: 
the message is compressed according to a certain 
distribution, which is chosen from a predetermined 
model class and is also communicated to the decom- 
pressor. The distribution is chosen so that the sum 
of the lengths of its description and the compressed 
version of the message are minimized. 

Inequalities and convex analysis. A principle satis- 
fied by information measures is that processing can- 
not increase either the dependence between input 
and output as measured by mutual information or 
the relative entropy between any pair of distribu- 
tions governing the input of the processor. Mathe- 
matically, the nonnegativity of relative entropy and 
those data processing principles are translated into 
convex inequalities, which have been used success- 
fully in the rederivation of various inequalities, such 
as those of Hadamard and Brunn-Minkowski, and in 
the discovery of new inequalities. 

Portfolio theory. One possible approach to portfolio 
selection [V.10] (for a given number of stocks) is to 
choose the log-optimal portfolio, which maximizes 
the asymptotic appreciation growth rate. When their 
distribution is known, a simplistic model of indepen- 
dent identically distributed stock prices leads to lim- 
iting results with a strong information-theoretic fla- 
vor. Just as in data compression, under assumptions 
of stationarity and ergodicity, it is possible to deal 
with more realistic scenarios in which the distribu- 
tion is not known a priori and the stock prices are 
interdependent. 

Identification. Suppose that the transmitter sends the 
identity of an addressee to a multitude of possible 
users. Each user is interested only in finding out 
whether it is indeed the addressee or not. Allowing, 
as usual, a certain error probability, this setup can 
be captured as in figure 1, except that the decoder is 
free to declare a list of several messages (addresses) 
to be simultaneously “true.” Each user simply checks 
whether its identity is in the list or not. How many 
messages can be transmitted while guaranteeing van- 
ishing probability of erroneous information? The 


surprising answer found by Ahlswede and Dueck in 
1989 is that the number of addresses grows dou- 
bly exponentially with the number of channel uses. 
Moreover, the second-order exponent is equal to the 
channel capacity. 

Finally, we mention the discipline of quantum infor- 
mation theory, which deals with the counterparts of 
the fundamental limits discussed above for quantum 
mechanical models of sources and channels. Probabil- 
ity measures, conditional probabilities, and bits trans- 
late into density matrices, self-adjoint linear operators, 
and qubits. The quantum channel coding theorem was 
proved by Holevo in 1973, while the quantum source 
coding theorem was proved by Schumacher in 1995. 
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IV. 3 7 Applied Combinatorics and 
Graph Theory 

Peter Winkler 

1 Introduction 

Combinatorics and graph theory are the cornerstones 
of discrete mathematics, which has seen an explosion 
of activity since the middle of the twentieth century. 
The main reason for this explosion is the plethora of 
applications in a world where digital (as opposed to 
analog) computing has become the norm. Once consid- 
ered more “recreational” than serious, combinatorics 
and graph theory now boast many fundamental and 
useful results, adding up to a cogent theory. Our objec- 
tive here is to present the most elementary of these 
results in a format useful to those who may run into 
combinatorial problems in applications but have not 
studied combinatorics or graph theory. 

Accordingly, we will begin each section with a (not 
necessarily serious, but representative) problem, intro- 
ducing the basic techniques, algorithms, and theorems 
of combinatorics and graph theory in response. 

We will assume basic familiarity with mathematics 
but none with computer science. Proofs, sometimes 
informal, are included when they are useful and short. 
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Algorithms will be discussed informally, with the term 
“efficient” used for those that can be executed in time 
bounded by a small-degree polynomial in the input 
length. 

2 Counting Possibilities 

Most counting problems can be phrased as, “How 
many ways are there to do XT' where X is some task, 
either real or imaginary. If there are several alterna- 
tive approaches, we may be able to make use of the 
following rule. 

The addition rule. If there are k different approaches 
to the task, and the ith approach can be executed in n,- 
ways, then the total number of different ways to do the 
task is the sum SLi n i- 

If, on the other hand, we can break up the task into 
stages, we can use of the following alternative. 

The multiplication rule. If the task involves k stages, 
the ith of which can be done rij ways regardless of 
choices made in previous stages, then the total number 
of different ways to do the task is the product Ili=i n i- 

Sometimes it is easier to count the things that are 
not wanted, in which case we can make use of the 
subtraction rule. 

The subtraction rule. If a set A consists of the elements 
of C other than those of a subset B of C, then |A| = 
\C\ - \B\. 

Overcounting can be useful, and when everything is 
overcounted the same number of times, the following 
rule is the remedy. 

The division rule. If the number n is obtained when 
every element of a set S is counted k times, then |S| = 
n/k. 

An arrangement, or ordering, of a set A of distin- 
guishable objects is called a permutation of A; from 
the multiplication rule, we see that, if | A\ = n, then the 
number of permutations of A is n\. If we want only to 
select and order k of the elements of A, the number of 
ways of doingsoisn(n-l)(n-2) ■ ■ ■ (n-k+1), which 
we can also write as n\/(n - k)\. (But note that calcu- 
lating this number by computing n! and then dividing 
by (n - k)\ might be a mistake, as it involves numbers 
that are unnecessarily large.) 

Suppose we wish to select k of the n objects in A 
but not to order them. We have then overcounted by 
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a factor of k\, so the number of so-called combinations 
of k objects out of n is n!/(k!(n - k)\), which we call 
“n choose k” and denote by (£). These expressions 
are called “binomial coefficients” on account of their 
appearance in the formula 

(x + y) n = X (l)x k y n - k . 

k= o V fc / 

The binomial coefficients possess an astonishing 
number of nice properties. You can easily verify, for 
example, that 




Note that logically 



since there is just one way to select the empty set from 
A and one way to pick the whole set A from itself. This 
agrees with our formula if we stipulate that 0! = 1. 


3 Finding a Stable Matching 

Some readers will recall mass public weddings per- 
formed by the Reverend Sun Myung Moon (1920-2012) 
under the banner of the Unification Church. Let us 
stretch our imaginations a bit and suppose that n men 
and n women wish to participate in such a ceremony 
but have not actually agreed upon their precise mates 
yet. Each submits a list of the n members of the oppo- 
site sex in preference order; these 2 n lists are submit- 
ted to the church elders, who somehow determine a 
matching, that is, a set of n man-woman pairs to be 
married on the fateful day. 

One might devise many reasonable criteria by which 
such a matching could be chosen; a particularly desir- 
able one, from the church’s point of view especially, 
is that the matching be stable. This means that there 
should not be any man-woman pair (say, Alice and Bill) 
who are not to be married to each other in the ceremony 
but who would rather be married to each other than 
to the persons they are supposed to marry. Such an 
Alice and Bill, whom we term an “unstable pair,” would 
then be tempted to run away together and mess up the 
ceremony. 

Remarkably, regardless of the preference lists, a sta- 
ble matching is guaranteed to exist; moreover, there 
is an efficient algorithm to find one. Devised by David 
Gale and Lloyd Shapley in 1962, the algorithm is one of 
the most elegant in combinatorics and is widely used 
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around the world, perhaps most famously in the annual 
matching of medical-school graduates with internships 
in the United States. 

The algorithm proceeds as a series of mass propos- 
als, from men to women, say. The game begins with 
every man proposing to the highest-ranked woman on 
his list; each woman accepts (perhaps temporarily) only 
the highest ranking of the men who propose to her, 
rejecting the others immediately. 

In subsequent rounds, engaged men are idle, while 
each unengaged man proposes to the highest-ranked 
woman on his list who has not previously rejected him: 
namely, the woman just beneath the one who rejected 
him on the previous round. As before, each woman 
then rejects all proposals other than her highest-ranked 
proposing male. If she is unengaged or the latter is 
higher ranked (on her list, of course) than her current 
fiance, she accepts the new proposal (again temporar- 
ily) and dumps the old beau. A woman might remain 
engaged to one man through many rounds but if, at 
any time, she gets a proposal from a man she likes bet- 
ter, she will unceremoniously dump her old beau and 
sign up with the best new one. Her old beau will then 
reenter the market at the next round, proposing to the 
next woman on his preference list. 

The algorithm terminates when every male (and thus 
every female) is engaged; those final engagements con- 
stitute the output matching. 

Suppose, for example, that the preferences are as 
shown in figure 1. 

In the first round Alan proposes to Donna while Bob 
and Charlie propose to Emily; Emily rejects Charlie, 
while the other proposals are provisionally accepted. In 
the next round Charlie proposes to his second choice, 
Donna; Donna accepts, putting Alan back into bache- 
lorhood. In the third round Alan proposes to Emily, 
who accepts while ejecting Bob. In the fourth round Bob 
proposes to Donna, but she is happy with Charlie. In 
the fifth round Bob proposes to Flora and is accepted, 
ending the process with Alan matched to Emily, Bob to 
Flora, and Charlie to Donna. 

As with any algorithm, you should be asking yourself: 
does it always terminate, and, if so, does it necessarily 
terminate in a solution to the problem? 

Observe first that once a woman becomes engaged 
(which happens as soon as she gets her first proposal), 
she never becomes unengaged; she only “trades up.” 
Men, on the other hand, are lowering their expecta- 
tions as they accumulate rejections. But note that a 
man cannot run out of prospects because, if he were 
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Figure 1 Finding a stable matching. 


rejected by every woman, all the women would have to 
be engaged; an impossibility if he is not. During every 
round, either the number of engaged men goes up or 
at least one man is rejected and has his expectations 
lowered. Thus, the algorithm must end after at most 
n 2 rounds of proposals. 

Is the result a stable matching? Yes. Suppose for the 
sake of reaching a contradiction that Alice and Bill are 
an unstable pair: Alice is matched to Clint (say) but 
prefers Bill, while Bill is matched to Dora and prefers 
Alice. But then, to get engaged to Dora, Bill must at 
some point have been rejected by Alice; at that time, 
Alice must have had a fiance she preferred to Bill. Since 
she could only have traded up after that, she could not 
have ended up with Clint. 

Usually, in practice (e.g., with the medical-school 
graduates and internships), no actual proposals are 
made; preference lists are submitted and a computer 
simulates the algorithm. 

You might reasonably be asking: what happens if we 
do not have two sexes but just some even number of 
people who are to be paired up? In this version, often 
called the “stable roommates problem,” each person 
has a preference list involving all the other people. Alas, 
for the roommates there may not be a stable matching; 
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nor (as far as we know) is there an efficient algorithm 
that is guaranteed to find a stable matching when one 
exists. Too bad, because there is a nice new application: 
organ transplants. 

Getting a kidney transplant used to require finding 
one person who was willing to donate a kidney and who 
matched the kidney seeker in blood type (and possi- 
bly other characteristics as well). Nowadays, multiple 
transplants have made things much easier. Suppose 
Alice is willing to donate a kidney to Bob, but their 
blood types are incompatible. Meanwhile, elsewhere in 
the world, Cassie needs a kidney and has found a will- 
ing but poorly matched donor, Daniel. The point is that 
Alice and Daniel may be a good match, likewise Bob 
and Cassie. The four are brought together; by agree- 
ment, and simultaneously (so no one can chicken out 
between operations), Alice’s kidney goes to Cassie while 
Daniel’s goes to Bob. 

Finding such pairs obviously requires some central- 
ized database. Ideally, the database is used to rank, 
for each patient-donor pair such as Alice and Bob, all 
other patient-donor pairs according to how well their 
donor matches Bob. An algorithm that solves the stable 
roommate problem could be very useful. 

In practice, heuristic algorithms seem to do a good 
job of finding exchangeable pairs. These days, larger 
cycles of pairs are sometimes organized; also useful 
are chains, sometimes quite long ones, catalyzed by a 
single generous individual who is willing to donate a 
kidney to the patient pool. For the kidney application 
and much more, Alvin Roth and Lloyd Shapley won the 
Nobel Memorial Prize in Economic Sciences in 2012. 

There are many other consequences and generaliza- 
tions of the stable marriage theorem; we mention just 
one curious fact. In many cases there is more than one 
stable solution, and of course the Gale-Shapley algo- 
rithm (as presented here) finds just one. You might 
think the algorithm is neutral between the sexes or even 
favors women, but in fact it is not hard to show that 
in it every man gets the best match that he can get in 
any stable marriage, and every woman gets the worst 
match that she could get in any stable marriage! In the 
annual matching of medical-school graduates to intern- 
ships, it used to be the case that the internships did the 
proposing. Now it is the other way around. 

4 Correcting Errors 

Have you ever wondered why it is that, when you email 
a multi-megabyte file to a friend or colleague, it almost 
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Figure 2 A code that corrects one error. 

always comes through with not a single error? Can it be 
that the bits we send through air, wire, and fiber have 
error probability less than one in a trillion? 

The answer to that question is no: if we sent bits that 
carefully, the Internet would be a lot slower than it is. 
Errors do get made, but they are corrected. How? 

We are going to consider only the alphabet {0,1}, 
although much of what we say below can be general- 
ized to larger alphabets. An error-correcting code is a 
set of sequences (“codewords”) of some given length 
(say, n) that are used instead of sequences of a fixed 
shorter length (say, k), with the idea that a small num- 
ber of errors in a codeword can be rectified. Figure 2 
shows an example with n = 7 and k = 4. 

The code pictured above is called a Hamming(7,4) 
code, and it has the following properties. First, there 
are 16 = 2 4 codewords, one for each binary sequence 
of length 4, as we see. The codewords have length 7. 
To use the code, suppose the message we wish to send 
is 110010100110. We break that up into “blocks” of 
length 4: 1100 1010 0110. Each block is translated 
into its corresponding codeword: 1100011 1010101 
0110110, and the concatenation 110001110101010110 
110 is transmitted. 

At the other end, the received bit sequence is bro- 
ken up into substrings of length 7, each of which is 
then decoded to recover the original intended mes- 
sage. Decoding is easy, with this particular code, when 
there are no errors; the first four bits of each codeword 
identify the message sequence. 

The disadvantages of the scheme are obvious. It takes 
work (well, computer work) to code and decode, and we 
end up sending nearly twice as many bits as we need to. 
The gain is that we can now correct errors: in particular, 
as long as there is no more than one flipped bit in each 
codeword, we can recover the exact original message. 

To verify this, let us see how the code can be gen- 
erated. First we label the seven codeword bit-positions 
with the numbers from one to seven in binary. It does 
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Figure 3 Labeling and nimber addition 
to create a Hamming(7, 4) code. 


not actually matter how the labels are assigned, but it 
turns out to be convenient to let positions 5, 6, and 7 
have labels 100, 010, and 001 (binary for 4, 2, and 1). 
We give the first four places the labels Oil, 101, 110, 
and 111, respectively. 

These labels will be treated not as binary numbers 
but as binary “nimbers,” equivalently, as vectors in 
the three-dimensional vector space {0, l} 3 over a two- 
element field. That means that nimbers are added with- 
out carry; in other words, using the rules that 0 + 0 = 0, 
1 + 0 = 0+1 = 1, and 1 + 1 = 0, each column of numbers 
is added independently. For example, Oil + 101 = 110. 
Addition is therefore the same as subtraction; in partic- 
ular, adding any nimber to itself gives 000. Given any 
seven-digit word, we compute its signature by adding, 
as nimbers, the labels of the positions where a 1 is 
found. Thus, for example, the signature of 0100100 is 
101 + 100 = 001. 

The position labeling and an addition table for three- 
digit nimbers are shown in figure 3. 

The codewords are exactly the seven-digit words with 
signature 000. That makes the set of codewords a lin- 
ear subspace of {0, l} 7 , which goes a long way to mak- 
ing the code easy to deal with. Computing the code of 
a given four-digit message is a snap because the first 
four digits of the code are the same as those of the 
message and the other three function as “check bits.” 
If the labels of the first four positions sum to 110, say, 
then the last three must be exactly 1,1,0. 

The key property possessed by the set of codewords 
is that any two codewords differ in at least three of the 


seven positions. We know this because if two seven- 
digit words differ in only two places, with labels ahc 
and def then their signatures differ by abc- def which 
cannot be 000. They cannot, therefore, both be code- 
words. (The argument is even easier if they differ in only 
one position.) Thus, if a bit in a codeword is flipped, 
it still differs in at least two places from all the other 
codewords, so there is only one codeword it could have 
come from. Determining that codeword is again easy: 
if the signature of the received word is ghi + 000, then 
it must be the bit whose label is ghi that got flipped. 

The Hamming codes are said to be “perfect one-error- 
correcting codes,” meaning that every word of length 
n is either a codeword or is one bit-flip away from a 
unique codeword. Another way to think of it is that 
(in the Hamming(7,4) code) each codeword has seven 
neighbors, obtained by flipping one bit; together with 
the codeword they make a “ball” of size 2 3 , and these 
2 4 balls then partition the whole space of seven-digit 
words, whose size is of course 2 7 . 

There is in fact an even easier one-error-correcting 
code than the Hamming(7,4) code, with n = 3 and 
k = 1: just repeat each bit three times. If you receive 
“110” you will know that the third bit was flipped, and 
that the codeword should therefore have been “111” 
and that the intent had been to send the bit “1”. More- 
over, this code tolerates an error every three bits, while 
the previous could handle only one error every seven 
bits. But the simple code has a “rate” of 1/3, mean- 
ing that it sends only one bit of information per three 
bits transmitted; the Hamming (7,4) code has the better 
rate 4/7. 

In practice (e.g., on the Internet), codes with much 
greater block length, that can correct several errors, 
are used. Sophisticated error-correcting codes are also 
used in transmitting information (e.g., pictures) back 
to Earth from outer space. A simple kind of error- 
correcting code is used in most bank account numbers, 
to either detect errors or (as above) correct them. 

Coding theory is a big subject with lots of linear alge- 
bra as well as combinatorics in it; moreover, it is a 
lively area of research, chockablock with theorems and 
applications. 

Note that the public often equates “codes” with 
“secret codes,” but we are not talking about cryptog- 
raphy here. Error correcting is often done “on top of” 
cryptography in the following sense: a message is first 
encrypted using a secret code (this might not result in 
any lengthening), to create what is called cyphertext. 
The cyphertext is then coded with an error-correcting 
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code before it is sent. At the other end, it is decoded 
to remove transmission errors and recover the correct 
cyphertext; then the cyphertext is decrypted to get the 
original message. Whew! 

There is, nonetheless, an amusing direct application 
of error-correcting codes to espionage. Suppose that 
you are a spy in deep cover and your only means of 
communication with your headquarters is as follows. 
A local radio station transmits an unpredictable string 
of seven bits every day, and you can flip any one of 
those bits before they are broadcast. How many bits 
of information can you transmit to headquarters that 
way? 

You can clearly transmit one bit by controlling the 
parity of the string, that is, whether the string has an 
even or an odd number of Is. And you certainly can- 
not communicate more than three bits because there 
are only 8 = 2 3 actions you can take (change any of 
the seven bits, or change none). But what good can 
these choices do if your control does not know what 
the original sequence was? 

It is easy using our Hamming(7, 4) code. To commu- 
nicate the three-bit message abc, just make the signa- 
ture of the broadcast abc. How do you know you can 
do this? If the broadcast string was going to have sig- 
nature def compute the sum of abc and def as nimbers 
to get (say) ghi, and change the bit with label ghi. If, by 
chance, the broadcast string already had signature abc, 
do not change anything. Your control only has to com- 
pute the signature of what she hears on the radio in 
order to get your message. 

Note that if your message is 000, the strings your 
control might get are exactly the Hamming(7,4) code- 
words from figure 2. 

5 Designing a Network 

As the head of a new company, or perhaps the emperor 
of a new country, you may need to design a commu- 
nications or transportation network to connect your 
offices or your cities. How can you do this as cheaply 
as possible? 

This and many other questions can be formulated 
in terms of graph theory [11.16], in which objects or 
places are abstracted as dots, any two of which may or 
may not bear a particular relation to each other. Graph 
theory is equally often thought of as a subfield and as a 
sister field of combinatorics. Like combinatorics, it was 
not highly regarded among “serious” mathematicians 
(between 1850 and 1950, roughly) until its importance 
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in computer science emerged. The advent of the Inter- 
net, which begs to be modeled as a graph, helped quite 
a bit as well. 

A graph G = ( V , E) is a set (finite, for us) V of vertices 
together with a collection E of two-element subsets of V 
called edges. The reason for this nomenclature is that 
the vertices and edges of a polyhedron constitute an 
archetypal graph. The degree of a vertex is the num- 
ber of edges containing it; a graph is regular if all its 
vertices have the same degree. If [u, v} 6 E, we write 
u ~ v and say that u is adjacent to v. 

A path (of length k) in a graph is a sequence 
vo, Vi , . . . , Vfc of vertices such that v* ~ i/; + i for each i, 
0 ^ i < k. We require that the v* are all distinct except 
that possibly vq = v k, in which case the path is said 
to be closed and the sequence Vo,Vi,...,Vi i -i is said 
to constitute a cycle. A nonclosed path is said to con- 
nect its first and last vertices, and if any two vertices 
in V can be so connected, G itself is connected. The set 
of vertices adjacent to a given vertex v is called the 
neighborhood of v and is denoted N(v). 

A graph that has no more edges than it needs to be 
connected is called a tree. A tree has no cycles (if it had 
one, deleting one edge of the cycle could not possibly 
destroy connectivity). It is easily verified that all of the 
following are equivalent for a graph G with | V| = n: 

(1) G is a tree; 

(2) G is connected and has at most n - 1 edges; 

(3) G is cycle-free and has at least n — 1 edges; 

(4) between any two vertices of G is a unique path; 

(5) either n = 1 or G has at least two vertices of 
degree 1 (“leaves”), the deletion of any of which, 
together with its incident edge, results in a tree. 

A transportation network is a graph in which the ver- 
tices represent terminals and edges represent direct 
connections between pairs of nodes; communications 
networks are modeled similarly. We can already deduce 
that a network with n vertices that is “efficient” in the 
sense that it has no more edges than it needs to be 
connected is a tree with n - 1 edges. But which n - 1 
edges? Suppose that the network is to be set up from 
scratch, and for each pair {u,vj of nodes there is an 
associated cost C(u,v) = C(v,u) of building an edge 
between them. The cost of a tree is the sum of the cost 
of its edges; is there a good way to find the cheapest 
tree? 

In fact, this could be one of the world's easiest prob- 
lems; there is no way to go wrong. You can simply start 
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Figure 4 The sixteen trees on vertex set {a, b, c, d}. 


by taking the cheapest edge, then the next cheapest 
edge, etc., only making sure that no cycle is made; this 
is a classic example of what is called a greedy algo- 
rithm. Alternatively (although this takes longer), you 
could imagine that you begin with all the edges (the 
“complete graph” K n ), then eliminate edges, starting 
from the most expensive. Here you must ensure that 
you never disconnect the graph; if removing an edge 
would do so, that edge is kept (for good). 

6 Enumerating Trees 

Suppose you belong to a special-interest group, and you 
wish to set up a notification tree. The idea is this: as 
soon as anyone obtains some information of interest 
to the group (such as the time and place of the next 
meeting), she notifies her neighbors on the tree, each 
of whom notifies all of their neighbors other than the 
one from whom she got the information, and so on. 

How many ways are there to choose such a tree? Here 
is an intuition test: if the group has, say, ten members, 
would you guess that the number of possible trees is 
less than a hundred million, or more? 

Very importantly, the vertices of the trees we are 
counting here are labeled, in this case by people. Thus, 
for example, there are three possible trees if there are 
only three group members (any of the three could be 
the center vertex). A little more work will convince you 
that there are sixteen trees with four labeled vertices, 
shown in figure 4. 

Beneath each tree you see a two-letter sequence 
called the Priifer code of that tree. The multiplication 
rule tells us that there are 16 = 4 2 two-letter sequences 
of symbols chosen from the set {a,b,c,d}, here, one 
for each tree. 

How is the Priifer code determined? We find the 
lowest-lettered leaf (that is, the leaf labeled by the ear- 
nest letter in the alphabet) of the tree and delete it, 
writing down not the label of the leaf but the label of 
the vertex it was adjacent to. We then repeat this pro- 
cess until only two vertices remain; thus, if the original 
tree had n vertices, we will end up with a sequence of 


n - 2 elements, possibly with some repetitions, from a 
label set of size n. 

The process is reversible. Suppose we have been 

given a sequence xi,X 2 x „_2 of n - 2 elements 

from an ordered set of size n; we claim there is exactly 
one way to reconstruct the labeled tree from which it 
arose. Note that the missing labels must correspond to 
the leaves of the tree; thus, the lowest missing label 
(say, a) is attached to x\. We now cross out xi from 
the sequence and repeat; a no longer counts as a miss- 
ing label, since it is taken care of. But now x\ will join 
the list of missing labels, unless it appears again later 
in the sequence. 

When we reach the last entry of the sequence, we have 
put in n - 1 edges that connect all the labels to make 
our tree. 

We have proved (albeit informally) that there is a 
one-to-one correspondence between trees with vertices 
labeled by a set of n elements and sequences of length 
n - 2 of elements from that set. One consequence of 
this is Cayley’s theorem (1854). 

Theorem (Cayley’s theorem). The number of labeled 
trees on n vertices is n n ~ 2 . 

This means that the number of possible notification 
trees for ten people is exactly a hundred million, so if 
you guessed that it was smaller or larger, you lose. 

Priifer’s correspondence has other useful properties. 
For example, we have already seen that leaves do not 
appear in the code; more generally, the number of times 
a label appears in the code is one less than its degree. 
So, suppose that we want David (d), Eleanor (e), and 
Fred (/) to be leaves of our notification tree (because 
they are not so reliable), while George ( g ) has degree 3. 
Then our tree’s code will have no d, e, or / in it, but 
two g s. There are ( 2 ) = 28 ways to decide which places 
in the sequence to put the g s, and each of the other six 
entries can be chosen 10-4 = 6 ways, so now there are 
only 28 ■ 6 6 = 1 306 368 ways to make the tree. 

7 Maximizing a Flow 

Suppose you are running an oil supply company whose 
pipelines constitute a graph G. Each edge e of G is a pipe 
with capacity c(e) (measured, perhaps, in gallons per 
minute). You need to move a lot of oil from vertex s (the 
“source”) to vertex t (the “tsink”); what’s the maximum 
flow rate you can achieve in this task? 

To define a “flow” abstractly we need to represent 
each edge {u, v} by two arcs, (u,v) (thought of as an 
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edge going from u to v ) and ( v , u ) . A flow is a function 
/ from the arcs to the nonnegative real numbers that 
satisfies the following property. 

For any vertex u other than s or t, ^ V ~ u f( v ,u) = 

S v~uf(U’V). 

This condition says that the amount flowing into a 
vertex is equal to the amount flowing out, except at the 
source (where, normally, we only want stuff flowing out) 
and the tsink (where stuff only flows in). The amount 
leaving the source is then equal to the amount arriving 
at the tsink, and we call that amount the magnitude of 
the flow, denoted |/|. 

A flow is valid if it respects capacities, that is, if 
f(u,v) ^ c(u,v) for each arc (u,v). (Initially, the 
capacity of each arc ( u,v ) is the capacity of the edge 
{u,v} that gave birth to it.) Our object is to maximize 
|/| subject to / being a valid flow. 

The “max-flow problem” was formulated in the 1950s 
by T. E. Harris (modeling Soviet rail traffic) and solved 
by, among others, Lester Ford and Delbert Fulkerson 
in 1955. Their algorithm and theorem are presented 
below. Since that time, there have been many alterna- 
tives, variations, and improvements, e.g., by Dinitz, by 
Edmonds and Karp, and by Goldberg and Tarjan. 

The Ford-Fulkerson algorithm proceeds by iteration 
and theoretically might not terminate when the capac- 
ities are not rational. If the capacities are integers, it 
yields an integer flow that is maximal among all flows. 
The idea of the algorithm is that given some flow fi, we 
“subtract” it from the current graph Gi to create what 
is called a residual network G;+ 1 , and then we look for 
a flow on this that can be added to /;. 

We can start with the zero flow and look for any path 
in Go = G from s to t. We can then send whatever is the 
minimum capacity (say, c) of the arcs on the path from 
5 to t. 

If a step of the path is from u to v, then in the resid- 
ual graph G\ we must reduce the capacity of the arc 
(u,v) by c. Equally importantly, we must increase the 
capacity of the reverse arc (v,u) by c because we may 
later wish to send stuff from v to u and, in so doing, we 
get to cancel stuff sent from u to u as well as refill the 
pipe in the other direction. So, in the residual graphs 
the two arcs constituting a single edge will generally 
have different capacities, one of which may be zero. (We 
can have that in the original graph, too, if we want.) 

At the next step we look for a path in Gi from s to t 
involving arcs with positive capacity, and we repeat the 
procedure. If we can find no such path, we are done. 
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Figure 5 A maximum flow and a 
minimum cut (value 9) are found. 


How do we know there is no “positive” path? We can 
check each vertex, starting with the neighbors of s and 
then moving to their neighbors, etc., to see whether 
flow can be sent there. Let S be the set of all vertices, 
5 itself included, to which we can still send material. If 
S does not contain t, it means that no flow can be sent 
out of S in our current residual graph. 

Figure 5 shows four Ford-Fulkerson steps leading to 
a total flow of value 9 and also to a “cut” (see below) of 
the same value. The heavy black line in each residual 
graph shows the chosen positive path. 

When the algorithm terminates, after step k, say, the 
sum / = f i + f 2 + ■■■+ fk of our flows is a maxi- 
mum flow, and we can prove it. The reason we cannot 
escape the set S of vertices that we can still ship to can 
only be that f has used all the original arcs from S to 
V \ S to full capacity, and no valid flow on G can do 
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better than that. Any set of vertices containing s and 
not t is called a cut; the capacity of a cut is the sum 
of the capacities of the arcs leaving it. No flow's mag- 
nitude can exceed the capacity of any cut; succinctly, 
“max flow < min cut.” But when the Ford-Fulkerson 
algorithm terminates, we have correctly deduced that 
max flow = min cut. In fact, this is true whether the 
algorithm terminates or not, but we will not complete 
the proof here. 

The point is that there are efficient algorithms 
(linear programming [IV. 1 1 §3] is one method) to 
determine both the maximum flow and the minimum 
cut in any s-t network. The fact that these quantities 
are equal (which is itself a special case of linear pro- 
gramming duality) is elementary but powerful; we will 
see one of its myriad consequences in the next section. 

8 Assigning Workers to Jobs 

Suppose you are managing a business and have jobs to 
fill, together with a pool of workers from which to fill 
them. The difficulty is that each worker is qualified to 
do only certain jobs. Can the positions be filled? 

The problem seems similar to finding a stable match- 
ing; we do want a matching, but here the criterion is 
simply that each job is matched to a worker who is 
qualified to do it. Clearly, such a matching might not 
be available; for example, there might be a job no one 
in the pool is qualified to do. How can we tell when all 
the positions can be filled? 

The situation is modeled nicely by what we call a 
bipartite graph, one whose vertices can be split into 
two parts (here, workers and jobs) in such a way that 
all edges contain a vertex from each part. For this prob- 
lem, let X be the set of jobs and Y the pool of workers; 
we draw an edge from x e X to y £ Y if job x can 
be performed by worker y. A matching is a disjoint set 
of edges. A complete matching (also called “a matching 
that covers X") is a disjoint set of edges that covers 
every vertex in X, i.e., fills every job. 

To find a complete matching, we clearly need \ Y\ ^ 

| A |, and, as mentioned before, we need every x e X to 
have degree at least 1. In fact, we can generalize these 
two requirements as follows. For any set of jobs A c X, 
let N ( A) be the set of workers who are qualified to do at 
least one job in A. We then need |N(A)| ^ |A| to have 
any chance of success. In words, then, in order for a 
complete matching to be possible, we require that for 
every set of k jobs, we need to have at least k workers 



Figure 6 A bipartite graph and its corresponding network. 


who are qualified to do one or more of the jobs in the 
set. 

That this set of criteria is sufficient is known as 
Hall’s marriage theorem and is attributed to Philip Hall 
(1935). 

Theorem (Hall’s marriage theorem). Let G be a bipar- 
tite graph with parts X and Y. A matching that covers 
X is then possible if and only if for every subset A c X, 
the neighborhood N(A) of A in Y has size at least the 
size of A. 

Proof. If there is a matching covering X, then any sub- 
set A of X is matched to a set in Y of equal size, so A’s 
neighborhood in Y must have been at least the size of 

A. 

What remains is to show that, if G does not have a 
matching that covers X, then there must be a subset A 
of X with ]7V(A)| < ] A\. We do this by creating from G 
an s-t network H, as follows. We add a source s adja- 
cent to every vertex in X, and a tsink t adjacent to every 
vertex in Y. All arcs from s to X, X to Y , and Y to t are 
given capacity 1, all other arcs 0 (see figure 6). 

If there is a complete matching M, we can use the 
edges of M and the new edges to get a flow on H of 
magnitude \X\. Otherwise, by the max- flow min-cut the- 
orem, there is a cut S in H of capacity less than Aj ; 
let Sx = S n X , Sy = S n Y. The capacity of S is the 
number of arcs flowing out of it, which are of three 
types: edges from s to X \ Sx, numbering \X\ - |Sxl; 
edges from Sx to Y \ Sy, numbering, say, b\ and edges 
from Sy to t, numbering | Sy | . But b + |Sy| is at least 
the size of the neighborhood N(Sx) of Sx in G because 
every such neighbor has an edge counted in |Sy| if it 
is in Sy and an edge counted in b if it is not. Thus 
\X\ - |Sxl + \N(Sx)\ < |A|, and therefore |JV(Sx)l < 

| Sx I , violating Hall's condition. □ 
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9 Distributing Frequencies 

Suppose you are setting up wireless phone towers in 
a developing country. Each tower communicates with 
the cell phones in its range using one of a small set 
of reserved frequencies. (Many cell phones can use the 
same frequency; in one method, called “time-division 
multiplexing," the tower divides each millisecond into 
parts and assigns one part to each active phone.) How 
can the frequencies be assigned so that no two tow- 
ers that are close enough to cause interference use the 
same frequency? 

We naturally model the towers by a graph G whose 
vertices are the towers, two of which are adjacent if they 
have a potential interference problem. We then need 
to assign frequencies— think of them as colors— to the 
vertices in such a way that no two adjacent vertices get 
the same color. If we can get such a “proper” coloring 
with k colors, G is said to be “k-colorable.” The smallest 
k for which G is k-colorable is the chromatic number of 
G, denoted x(G). 

Bipartite graphs are 2-colorable, since we can use one 
color for each part. But suppose G is given to us without 
identifying parts; can we tell when it is 2-colorable? 

The answer is yes; there is an efficient algorithm for 
this. We may assume that G is connected (otherwise, 
we execute the algorithm on each connected piece of 
G). Pick any vertex v and color it red; then color all of 
its neighbors blue. Now color all their neighbors red, 
and so on until all the vertices of G are colored. 

When will this fail? In order for two neighboring ver- 
tices, say x and y, to get the same color, each must 
be reachable from v by paths of the same parity (both 
even length, or both odd). Let z be the closest point to x 
and y that shares this property with v ; the paths from 
z to x and to y, together with the edge from x to y, 
then form a cycle in G of odd length. 

But if there is an odd cycle in G, it was never possible 
to color G with two colors. The algorithm will there- 
fore work whenever G is 2-colorable and will (quickly) 
come a cropper otherwise. Moreover, we have derived 
an equivalent condition for 2-colorability. 

Theorem. A graph G has x(G) C 2 if and only if G 
contains no odd cycle. 

Alas, the situation with larger numbers of colors is 
apparently quite different; unless P = NP, there is 
no efficient general algorithm for determining when 
X(G) ^ k for fixed k > 2. 

However, there are efficient algorithms for a related 
problem. Suppose the cell towers need to communicate 
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with each other, using a special set S of frequencies that 
cause interference only when two messages at the same 
frequency are both being transmitted, or both being 
received, by the same tower. 

Then, if |S| = k and some tower wants to transmit 
messages to more than k other towers at the same time, 
or to receive from more than k other towers at the same 
time, it is out of luck. So each tower limits its objectives 
accordingly, but does that mean that frequencies can be 
assigned to messages in such a way that no interference 
occurs? 

Here, the graph we want (call it H) needs to have two 
vertices for every tower, one for its role as a transmit- 
ter, the other for its role as a receiver. We put an edge 
between a transmitter node x and a receiver node y if 
tower x wants to send a message to tower y.H is there- 
fore a bipartite graph, and we could properly color its 
vertices with two colors if we wished. 

However, it is not towers but messages that need fre- 
quencies here; in other words, we need to color the 
edges, not the vertices, of H, and we want no vertex 
to be in two edges of the same color. Equivalently, we 
want the set of edges that get any particular color to 
constitute a matching. If we have a vertex (transmitter 
or receiver) of degree d and if d exceeds the number k 
of available colors, then as we have already noted, we 
are stuck. Surprisingly, and very usefully, we can, if fc 
is at least the maximum degree, always assign colors 
without a conflict. 

Theorem. Let H be a bipartite graph all of whose ver- 
tices have degree at most k. The edges of H can then 
be colored from a palette ofk colors in such a way that 
no vertex is contained in two edges of the same color. 

This theorem is often attributed to Denes Konig 
(1931), although there are closely related results, both 
earlier and later, by others. To prove it, and indeed 
to actually find such a coloring, it suffices to find a 
matching that covers all the vertices of degree k; we 
can do that using Hall's marriage theorem or the max- 
flow min-cut theorem. We then paint all the edges 
in the matching with the kth color and then remove 
those edges to get a graph of maximum degree at most 
k - 1. Now we repeat the process until we are down to 
degree 0, with all edges colored. 

10 Avoiding Crossings 

Suppose you want to build a one-layer microchip with 
components connected as in a particular graph G. Can 
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Figure 7 Two drawings of the complete 
graph on four vertices. 


you do it? More abstractly, given G, can you draw it 
in the plane so that no two edges cross? You might 
choose to draw the graph A 4 (four vertices, every pair 
constituting an edge) as pictured in figure 7(a), but with 
a little care you can redraw it as pictured in part (b), with 
no crossings. 

A graph drawn on the plane without crossings is 
called a plane graph, and a graph that can be drawn 
that way is said to be planar. (The edges of a plane 
graph need not be straight line segments, although in 
fact they can always be made so.) 

The famous “four-color map theorem,” proved in 
1976 by Ken Apel and Wolfgang Haken with computer 
help, says that every planar graph is 4-colorable. The 
connection between coloring the (contiguous) countries 
on a map and graph coloring is as follows. Given a map, 
we associate to every country a point in that country, 
connecting two points in adjacent countries by an edge 
that crosses their common border. Properly coloring 
the vertices of this “dual” graph is equivalent to color- 
ing the countries in such a way that no two countries 
with a common border get the same color. 

The four-color theorem may seem like a somewhat 
frivolous result — after all, most cartographers have 
plenty of colors and routinely use more than four— 
but, in fact, it is a fundamental and important theorem 
of graph theory. Planar graphs themselves constitute a 
crucial class of graphs, one that we would be singling 
out even if we did not live in space, had no use for 
planes, and had no aversion to crossings. 

Since the graph A 5 requires five colors, it cannot be 
planar, if you believe Apel and Haken. But let us see 
this directly. Our primary tool is the Jordan curve the- 
orem, which says that a simple closed curve in the 
plane divides the plane into two connected regions 
(the bounded region inside the curve, and the outside 
region). Assume that A 5 is drawn on the plane without 
crossings, with vertices labeled 1-5. Note that A 5 has 
(2) = 10 edges. The five edges constituting the cycle 


1 -2-3-4- 5 I make a closed curve, and therefore of 

the five “noncycle” edges, either at least three must fall 
inside the curve or at least three must fall outside. 

Suppose there are three inside (we can turn the argu- 
ment inside out to do the other case). We may assume 
that one is the edge from 1 to 4 (say), but then only 
one other noncycle edge (from 1 to 3) fails to cross 
something, so we are stuck. 

The complete bipartite graph A 33 , in which (say) ver- 
tices 1, 2, and 3 are adjacent to 4, 5, and 6 , is of course 
2-colorable, but that does not make it planar. Indeed, 
assuming it were, consideration of the simple closed 
curve made by the edges of the cycle 1~4~2~5~3~6~1 
would again lead to a contradiction. 

Adding new vertices along the edges of A 5 or A 33 , 
thereby creating a “subdivision,” makes no difference 
to the above proofs. The remarkable fact is that we have 
now reached sufficient criteria. The following result, 
usually known as Kuratowski’s theorem, was proved 
by Kazimierz Kuratowski in 1930, as well as by others 
around that time. 

Theorem (Kuratowski’s theorem). A graph G is planar 
if and only if it does not contain, as a subgraph, any 
copy of Ks orK-j t -i, or any subdhlsion thereof. 

We will not give a proof here, but we do note that 
efficient algorithms exist to determine whether a given 
graph G is planar and, if it is, to construct a drawing of 
G with no crossings. 

1 1 Delivering the Mail 

Suppose you have signed up to be a postman and wish 
to devise a route that Mil traverse every street in your 
district exactly once. When is this possible? When it is, 
how can you find such a route? 

In the abstract, you are given a graph G and wish 
to devise a walk (a sequence of vertices, not necessarily 
distinct, any two consecutive vertices of which are adja- 
cent) that traverses each edge exactly once. Perhaps you 
would also like to start at a particular vertex s and end 
at a particular vertex t, possibly the same one. Such a 
walk is called an Eulerian tour or, if it starts and ends 
at the same point, an Eulerian circuit. 

Euler famously observed, in connection with the 
problem of touring the Konigsberg (now Kaliningrad) 
bridges, that such a walk is not possible if G has more 
than two vertices of odd degree. The reason for this is 
simply that a vertex that is not the first or last must 
be exited the same number of times it is entered, thus 
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using up its edges in pairs. What Euler did not do is 
prove that the possession of not more than two ver- 
tices of odd degree, together with connectedness, is 
sufficient as well as necessary for such a walk to exist. 

Theorem. If G is a connected graph with no vertices of 
odd degree, it has an Eulerian circuit. 

Proof. Let P be a maximum-length walk traversing no 
edge twice. We claim that it is an Eulerian circuit. 

Note first that if P starts at u it must end there as 
well, since u is the only place it could get stuck. Why? 
To get stuck at a vertex v * u, you must have run out 
of edges containing v after using k of them to get to v 
and k— 1 to escape from u, for some k. But v, like all 
vertices of G, is supposed to have even degree. 

If P is not an Eulerian tour, then because G is con- 
nected there is an unused edge of G that connects a 
vertex of P, say v, with some vertex iv (that may or 
may not be on P). Make a new walk Q starting at w, 
then traversing the edge {v, w} to v, then following P 
forward to u, then following the first part of P from 
u back to v. Q is longer than P (and is not even stuck 
yet), so our contradiction has been reached. □ 

If G is connected and has precisely two vertices of 
odd degree, say x and y, we can connect them by an 
edge to get a graph all of whose vertices have even 
degree, and then we can use the theorem to find an 
Eulerian circuit P. Tossing the new edge out of P gives 
an Eulerian tour beginning at x and ending at y. The 
new edge may duplicate an edge already present in G, 
but the theorem and proof above work fine for “multi- 
graphs,” which are allowed to have more than one edge 
containing the same two vertices. 

In fact, the theorem and its proof also work fine 
for directed graphs, also called “digraphs,” with arcs 
instead of edges that can be traversed only in one direc- 
tion; some may even be loops from a vertex back to 
itself. The necessary, and again sufficient, condition for 
existence of an Eulerian circuit in a connected digraph 
D is that the “indegree” (the number of arcs entering a 
vertex) must be equal to the outdegree for every vertex 
in D. Whether for graphs or digraphs, the proof above 
implicitly contains one of many efficient algorithms for 
actually generating Eulerian tours or circuits. 

The Eulerian circuit theorem for digraphs has a nice 
application to de Bruijn sequences. Suppose you have 
written some software for a device that has k but- 
tons and you wish to ascertain that no sequence of 
n button pushes will disable the device. You could 
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separately enter all k n possible sequences, but it might 
make sense to save time by making use of overlap. For 
example, if there are just two buttons (labeled 0 and 1) 
and n = 3, you could enter the sequence 0001011100, 
which captures all eight length-three binary strings as 
substrings. Such a sequence, which contains exactly 
once each string of length n over an alphabet of size 
k, is called a deBruijn(k,n) sequence, thus named on 
account of a 1946 paper by Nicolaas Covert de Bruijn. 
The sequences themselves date back to antiquity; one 
appears in Sanskrit prosody. 

Typically, de Bruijn sequences are thought of as cir- 
cular, so the above sequence might be shortened to 
0001011 1 with the understanding that you are allowed 
to go “around the corner” to get 110 and 100. Since 
you cannot go around the corner in time, checking your 
device requires that you append a copy of the first k - 1 
characters of a circular sequence to the end. 

To see that deBruijn(k, n) sequences exist for any 
positive integers k and n, we create a digraph D (k, n) as 
follows. The vertices of D(k, n) are all Zc” -1 sequences 
of length k - 1 from our alphabet of size n. Arcs cor- 
respond to sequences X \ , . . . , Xjt of length n and run 

from the vertex x\,...,Xk~\ to the vertex X 2 Xk\ 

note that this will be a loop when all the x* happen to 
be the same character. 

The outdegree of every vertex in D(k,n) will be n, 
since we can add a kth character in n ways, and sim- 
ilarly the outdegrees will also all be n. To see that 
D(k,n) is connected, let x be one of the alphabetic 
characters, and observe that we can get from any ver- 
tex to xxx ■ ■ ■ x by repeatedly adding x to the end and 
from xxx ■ ■ ■ x to any vertex by adding that vertex’s 
characters one by one. 

Applying the theorem gives an Eulerian circuit in D, 
exactly what is required for a deBruijn(Zc, n) sequence. 
In fact, it can be shown that there are k\ k " 1 /k n Eule- 
rian circuits in (k, n), and thus the same number of 
de Bruijn sequences, considered as cycles. Pictured in 
figure 8 is the digraph D( 4, 2), together with one of the 
20 736 resulting deBruijn(4, 2) sequences. 

It is a curiosity that, given any digraph, one can effi- 
ciently compute the precise number of Eulerian cir- 
cuits, while for ordinary (undirected) graphs, no one 
knows an efficient algorithm that can even estimate 
that number. (In many cases, algorithms that work 
for digraphs work just as well for undirected graphs 
because you can replace an edge by an arc in each direc- 
tion, as we did above in section 7. But doing that here 
would result in walks that traverse every edge twice.) 
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Figure 8 The digraph D(4, 2) and the de Bruijn 
sequence arising from an Eulerian circuit. 

We end this section, and indeed our whole discussion 
of applied combinatorics and graph theory, with one 
additional observation. Instead of the digraph D(k, n), 
we could have created an undirected graph G(k,n) 
whose vertices are all the sequences of length k, two 
being connected if they overlap in a string of length 
k - 1. A de Bruijn(fc, n) sequence would then constitute 
a cycle in G(k, n) that hits every vertex (instead of every 
edge) exactly once; such a cycle is called a Hamilton 
circuit. 

But computability with respect to Hamilton circuits 
is startlingly different from the Eulerian case. Unless 
P = NP, there is no efficient way to determine whether 
an input graph has a Hamilton circuit or, if it does, to 
find one. It is hard to say why, apart from noting that 
in one case there is a theorem that provides an easy 
algorithm while in the other case no known theorem 
comes to the rescue. It seems that some combinatorial 
tasks are easy, and some are hard; that is just the way 
it is. 

We have tried here to present some of the combina- 
torial problems that are easy, once you know how to 
do them. We hope that all your problems are similarly 
straightforward, but if they are not, some (of the many) 
books that will get you deeper into the above topics are 
listed below. 
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IV.38 Combinatorial Optimization 

0 Jens Vygen 

Combinatorial optimization problems arise in numer- 
ous applications. In general, we look for an optimal ele- 
ment of a finite set. However, this set is too large to be 
enumerated; it is implicitly given by its combinatorial 
structure. The goal is to develop efficient algorithms by 
understanding and exploiting this structure. 

1 Some Important Problems 

First we give some classical examples. We refer to the 
article on graph theory [11.16] for basic notation. In 
a directed graph, we denote by 5 + (A) and <5 - (A') the 
set of edges leaving and entering X, respectively; here, 
X can be a vertex or a set of vertices. In an undirected 
graph, 5(A) denotes the set of edges with exactly one 
endpoint in X. 

1.1 Spanning Trees 

In this problem we are given a finite connected undi- 
rected graph (V, E) (so V is the set of vertices and E the 
set of edges) and weights on the edges, i.e., c(e) e R 
for all e e E. The task is to find a set T £ E such that 
(V, T) is a (spanning) tree and Y.esT c(e) is minimized. 
(Recall that a tree is a connected graph without cycles.) 

A set V of eight points in the Euclidean plane is 
shown on the left of the figure below. Assuming that 
(V, E) is the complete graph on these points (every pair 
of vertices is connected by an edge) and c is the Euclid- 
ean distance, the right-hand side shows an optimal 
solution. 



1.2 Maximum Flows 

Given a finite directed graph (V,E), two vertices 5, t e 
V (source and sink), and capacities u(e ) e R^o for all 
e £ E, we look for an s-t flow f:E — R^o with /(e) ^ 
u(e) for all e e E and f(5~(v)) = f(5 + (v)) for all 
v e V \ {s,t} (flow conservation: the total incoming 
flow equals the total outgoing flow at any vertex except 
5 and t). The goal is to maximize f(8~(t )) - f(5 + (t)), 
i.e., the total amount of flow shipped from s to t. This 
is called the value of f. 
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The figure below illustrates this: the left-hand side 
displays an instance, with the capacities shown next to 
the edges, and the right-hand side shows an s-t flow of 
value 7. This is not optimal. 


1.3 Matching 

Given a finite undirected graph (V,E), find a matching 
M £ E that is as large as possible. (A matching is a set 
of edges whose endpoints are all distinct.) 

1.4 Knapsack 

Given n e N, positive integers a*, h* (the profit and 
weight of item i, for i = 1, . . . , n), and B (the knapsack’s 
capacity), find a subset I £ {1, . . . , n} with Xte/ h; < B 
such that Xist cii is as large as possible. 

1.5 Traveling Salesman 

Given a finite set X with metric d, find a bijection 
tt: {1 n ! — X such that the length of the corre- 
sponding tour, 

n-1 

^ d(Tt(i),TT(i + 1 )) + d(Tt(n), tt ( 1 )), 
i=l 

is as small as possible. 

1.6 Set Covering 

Given a finite set U and subsets Si,...,S n of U, find the 
smallest collection of these subsets whose union is U, 
i.e., I £ {1 ,...,n} whh(JtE/5i = U and |J[ (the number 
of elements in the set I) minimized. 

2 General Formulation and Goals 

2.1 Instances and Solutions 

These problems have many common features. 

In each case there are infinitely many instances, each 
of which can be described (up to renaming) by a finite 
set of bits and in some cases by a finite set of real 
numbers. 

For each instance, there is a set of feasible solutions. 
This set is finite in most cases. In the maximum-flow 
problem it is actually infinite, but even here one can, 


without loss of generality, restrict to a finite set of 
solutions (see below). 

Given an instance and a feasible solution, we can eas- 
ily compute its value. For example, in the matching 
problem, the instances are the finite undirected graphs; 
for each instance G, the set of feasible solutions are 
the matchings in G; and for each matching, its value is 
simply its cardinality. 

Even if the number of feasible solutions is finite, it 
cannot be bounded by a polynomial in the instance 
size (the number of bits that is needed to describe the 
instance). For example, there are n n ~ 2 trees (V, T) with 
V = {1, . . . , n} (this is Cayley’s formula). Similarly, the 
number of matchings on n vertices, of subsets of an n- 
element set, and of permutations on n elements grow 
exponentially in n. One cannot enumerate all of them 
in reasonable time except for very small n. 

Whenever an instance contains real numbers, we 
assume that we can do elementary operations with 
them, or we assume them to be rationals with binary 
encoding. 

2.2 Algorithms 

The main goal of combinatorial optimization is to 
devise efficient algorithms for solving such problems. 

Efficient usually means in polynomial time, that is, 
the number of elementary steps can be bounded by a 
polynomial in the instance size. Of course, the faster 
the better. 

Solving a problem usually means always (i.e., for 
every given instance) computing a feasible solution 
with optimum value. 

We give an example of an efficient algorithm solving 
the spanning tree problem in section 3. 

However, for np-hard [1.4 §4.1] problems (like the 
last three examples in our list), an efficient algorithm 
that solves the problem does not exist unless P = NP, 
and consequently one is satisfied with less (see sec- 
tion 5). 

2.3 Other Goals 

Besides developing algorithms and proving their cor- 
rectness and efficiency, combinatorial optimization 
(and related areas) also comprises other work: 

• analyzing combinatorial structures, such as graphs, 
matroids, polyhedra, hypergraphs; 

• establishing relations between different combina- 
torial optimization problems (reductions, equiva- 
lence, bounds, relaxations); 
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• proving properties of optimal (or near-optimal) 
solutions; 

• studying the complexity of problems and establish- 
ing hardness results; 

• implementing algorithms and analyzing their prac- 
tical performance; and 

• applying combinatorial optimization problems to 
real-world problems. 


3 The Greedy Algorithm 

The spanning tree problem has a very simple solution: 
the greedy algorithm does the job. We can start with the 
empty set and successively pick a cheapest edge that 
does not create a cycle until our subgraph is connected. 
Formally, 

(1) sort E = {ei, . . . , e m } so that c{e i) ^ < c(e m ), 

(2) let T be the empty set, and 

(3) for i = 1, ... ,m do 

• if (V, T u {ei}) contains no cycle, 

• then add e,- to T. 

In our example, the first four steps would add 
the four shortest edges (shown on the left-hand side 
below). Then the dotted edge is examined, but it is not 
added as it would create a cycle. The right-hand side 
shows the final output of the algorithm. 



Since the greedy algorithm did not add ej to T, 
there must be a cycle with edge set C s {ej} u (T n 
{ei , . . . , ej-il) and ej 6 C. 

( V , T* \ {ej}) is not connected, so there is a set A c V 
with 5(A) n T* = {ej}. (Recall that 5(A) denotes the 
set of edges with exactly one endpoint in X.) 

Now |C n 5(A) | is even (any cycle enters a set A the 
same number of times that it leaves A), so it is at least 
two. Let ei s (Cn 5(A)) \ {ej}. Note that i < j and thus 
c(e t ) s; c(ej). 

Let T** := (T* \ {ej}) u {ej. Then (V, T**) is a tree 
with c(T**) = c(T*) - c(ej) + clef) ^ c(T*). So T** 
is also optimal. But T** has one more edge in common 
with T (the edge ei) than T* , contradicting the choice 
of T* . 

3.2 Generalizations 

In general (and for any of the other problems above), 
no simple greedy algorithm will always find an optimal 
solution. 

The reason that the greedy approach works for span- 
ning trees is that here the feasible solutions form the 
bases of a matroid. Matroids are a well-understood 
combinatorial structure that can in fact be character- 
ized by the optimality of the greedy algorithm. 

Generalizations such as optimization over the inter- 
section of two matroids or minimization of submodu- 
lar functions (given by an oracle) can also be solved in 
polynomial time, with more complicated combinatorial 
algorithms. 


This algorithm can be easily implemented so that it 
runs in O(nm) time, where n = ]V| and m = |£j. 
With a little more care, a running time of O(mlogn) 
can be obtained. This is therefore a polynomial-time 
algorithm. 

This algorithm computes a maximal set T such that 
(V, T) contains no cycle. In other words, (V, T) is a tree. 
It is not completely obvious that the output (V, T) is 
always an optimal solution, i.e., a tree with minimum 
weight. Let us give a nice and instructive proof of this 
fact. 

3.1 Proof of Correctness 

Let (V, T*) be an optimal tree, and choose T* so that 
|T* n T\ is as large as possible. Suppose T* ^ T. 

All spanning trees have exactly |V| - 1 edges, imply- 
ing that T* \T / 0. Let j e {1, . . . , m} be the smallest 
index with ej e T* \T. 


4 Duality and Min-Max Equations 

The relationships between different problems can lead 
to many important insights and algorithms. We give 
some well-known examples. 

4.1 The Max-Flow Min-Cut Theorem 

We begin with the maximum-flow problem and its rela- 
tion to s-t cuts. An s-t cut is the set of edges leaving 
A (denoted by 5 + (A)) for a set A c V with 5 6 A and 
t $ X. 

The total capacity of the edges in such an s-t cut, 
denoted by u(5 + (A)), is an upper bound on the value 
of any s-t flow / in (G, u). This is because this value 
is precisely /(5 + ( A)) - /(5~(A)) for every set A con- 
taining 5 but not t, and 0 ^ /(e) ^ u(e) for all 
e £ E. 

The famous max-flow min-cut theorem says that the 
upper bound is tight: the maximum value of an s-t flow 
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equals the minimum capacity of an s-t cut. In other 
words, if / is any s-t flow with maximum value, then 
there is a set X c V with 5 e X, t $ X, /(e) = u(e) for 
all e e <5 + (T0, and /(e) = 0 for all e e <5~(70. 

Indeed, if no such set exists, we can find a directed 
path P from s to t in which each edge e = (v,w) is 
either an edge of G with /(e) < u(e) or the reverse: 
e' := (w, v) is an edge of G with/(e') > 0. (This follows 
from letting X be the set of vertices that are reachable 
from 5 along such paths.) 

Such paths are called augmenting paths because 
along such a path we can augment the flow by increas- 
ing it on forward edges and decreasing it on back- 
ward edges. Some flow algorithms (but generally not 
the most efficient ones) start with the all-zero flow and 
successively find an augmenting path. 

The figure below shows how to augment the flow 
shown in section 1.2 by one unit along the path a-c-b- 
t (shown in bold on the left). The resulting flow, with 
value 8 (shown on the right), is optimal, as is proved by 
the s-t cut 5 + ({s,a,c}) = {( a,b ), (c,t)} of capacity 8. 


The above relation also shows that for finding an 
s-t cut with minimum capacity, it suffices to solve 
the maximum-flow problem. This can also be used to 
compute a minimum cut in an undirected graph or to 
compute the connectivity of a given graph. 

Any s-t flow can be decomposed into flows on s-t 
paths, and possibly on cycles (but cyclic flow is redun- 
dant as it does not contribute to the value). This decom- 
position can be done greedily, and the list of paths 
is then sufficient to recover the flow. This shows that 
one can restrict to a finite number of feasible solutions 
without loss of generality. 

4.2 Disjoint Paths 

If all capacities are integral (i.e., are integers), one can 
find a maximum flow by always augmenting by 1 along 
an augmenting path, until none exists anymore. This 
is not a polynomial-time algorithm (because the num- 
ber of iterations can grow exponentially in the instance 
size), but it shows that in this case there is always an 


optimal flow that is integral. An integral flow can be 
decomposed into integral flows on paths (and possibly 
cycles). 

Hence, in the special case of unit capacities an inte- 
gral flow can be regarded as a set of pairwise edge- 
disjoint s-t paths. Therefore, the max-flow min-cut 
theorem implies the following theorem, due to Karl 
Menger. Let (V,E) be a directed graph and let s,t e V. 
Then the maximum number of paths from s to f that 
are pairwise edge disjoint equals the minimum number 
of edges in an s-t cut. 

Other versions of Menger’ s theorem exist, for in- 
stance, for undirected graphs and for (internally) ver- 
tex-disjoint paths. 

In general, finding disjoint paths with prescribed 
endpoints is difficult; for example, it is np-complete 
[1.4 §4.1] to decide whether, in a given directed graph 
with vertices s and t, there is a path P from 5 to t and a 
path Q from t to 5 such that P and Q are edge disjoint. 

4.3 Linear Programming Duality 

The maximum-flow problem (and also generalizations 
like minimum-cost flows and multicommodity flows) 
can be formulated as linear programs in a straightfor- 
ward way. 

Most other combinatorial optimization problems 
involve binary decisions and can be formulated natu- 
rally as (mixed-) integer linear programs. We give an 
example for the matching problem. 

The matching problem can be written as the integer 
linear program 

max{l T x: Ax ^ 1, x e e {0, 1} V e e E}, 

where A is the vertex-edge-incidence matrix of the 
given graph G = ( V,E ), 1 = (1, 1, ..., 1) T denotes an 
appropriate all-one vector (so D T x is just an abbrevia- 
tion of > and ^ is meant componentwise. The 

feasible solutions to this integer linear program are 
exactly the incidence vectors of matchings in G. 

Solving integer linear programs is NP-hard in gen- 
eral (see section 5.2), but linear programs (without inte- 
grality constraints) can be solved in polynomial time 
(see continuous optimization [IV. 11 §3]). This is one 
reason why it is often useful to consider the linear 
relaxation, which here is 

max{l T x: Ax ^ 1, x ^ 0}, 

where 0 and 11 denote appropriate all-zero and all-one 
vectors, respectively. The entries of x can now be any 
real numbers between 0 and 1. 
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The dual linear program (LP) is 

min{j' T ll : y J A ^ 1, y ^ 0}. 

By weak duality, every dual feasible vector y yields an 
upper bound on the optimum. (Indeed, if x is the inci- 
dence vector of a matching M and y 0 with y J A ^ 1 , 
then \M\ = l T x ^ y T Ax ^ y J t.) 

If G is bipartite, it turns out that these two LPs actu- 
ally have integral optimal solutions. The minimal inte- 
gral feasible solutions of the dual LP are exactly the 
incidence vectors of vertex covers (sets X s V such 
that every edge has at least one endpoint in X). 

In other words, in any bipartite graph G, the maxi- 
mum size of a matching equals the minimum size of a 
vertex cover. This is a theorem of Denes Konig. It can 
also be deduced from the max-flow min-cut theorem. 

For general graphs, this is not the case, as, for exam- 
ple, the triangle (the complete graph on three ver- 
tices) shows. Nevertheless, the convex hull of incidence 
vectors of matchings in general graphs can also be 
described well; it is 

I, x^O, X VA<=yl 

eeE[A ] L 2 J J 

where £[,4] denotes the set of edges whose endpoints 
both belong to A. This was shown by Jack Edmonds 
in 1965, who also found a polynomial-time algorithm 
for the matching problem. In contrast, the problem of 
finding a minimum vertex cover in a given graph is 
NP-hard. 

5 Dealing with NP-Hard Problems 

The other three problems mentioned in section 1 (knap- 
sack, traveling salesman, and set covering) are NP-hard: 
they have a polynomial-time algorithm if and only if 
P = NP. 

Since most researchers believe that P =f= NP, they gave 
up looking for polynomial-time algorithms for NP-hard 
problems. Algorithms are sought with weaker proper- 
ties, for example ones that 

• solve interesting special cases in polynomial time; 

• run in exponential time but faster than trivial enu- 
meration; 

• always compute a feasible solution whose value is 
at most k times worse than the optimum (so-called 
k-approximation algorithms (see section 5.1)); 

• are efficient or compute good solutions for most 
instances, in some probabilistic model; 


• are randomized (use random bits in their computa- 
tion) and are expected to behave well; or 

• run fast and produce good results in practice, 
although there is no formal proof ( heuristics ). 

5.1 Approximation Algorithms 

From a theoretical point of view, the notion of approx- 
imation algorithms has proved to be most fruitful. For 
example, for the knapsack problem (section 1.4) there 
is an algorithm that for any given instance and any 
given number s > 0 computes a solution at most 
1 + £ times worse than the optimum, and whose run- 
ning time is proportional to n 2 /f. For the traveling 
salesman problem [VI. 18] (see section 1.5), there is a 
| -approximation algorithm. 

For set covering (section 1.6) there is no constant- 
factor approximation algorithm unless P = NP. But con- 
sider the special case where we ask for a minimum ver- 
tex cover in a given graph G; here, U is the edge set of G 

and Si = S(Vi) for i = 1 n, where V = [iq, . . . , v n } 

is the vertex set of G. Here, we can use the above- 
mentioned fact that the size of any matching in G is 
a lower bound. Indeed, if we take any (inclusion-wise) 
maximal matching M (e.g., one found by the greedy 
algorithm), then the 2|M| endpoints of the edges in M 
form a vertex cover. As \M\ is a lower bound on the 
optimum, this is a simple 2 -approximation algorithm. 

5.2 Integer Linear Optimization 

Most classical combinatorial optimization problems 
can be formulated as integer linear programs 

min{c T x: Ax ^ b, x e IT 1 }. 

This includes all problems discussed in this chapter, 
except the maximum-flow problem, which is in fact a 
linear program. The variables are often restricted to 0 
or 1. Sometimes, some variables are continuous, while 
others are discrete: 

min{c T x: Ax + By ( J, x£ R m , y 6 Z"}. 

Such problems are called mixed-integer linear pro- 
grams. 

Discrete optimization comprises combinatorial opti- 
mization but also general (mixed-) integer optimization 
problems with no special combinatorial structure. 

For general (mixed-) integer linear optimization all 
known algorithms have exponential worst-case run- 
ning time. The most successful algorithms in prac- 
tice use a combination of cutting planes and branch 
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and bound (see sections 6.2 and 6.7). These are imple- 
mented in advanced commercial software. Since many 
practical problems (including almost all classical com- 
binatorial optimization problems) can be described as 
(mixed-) integer linear programs, such software is rou- 
tinely used in practice to solve small and medium-sized 
instances of such problems. However, combinatorial 
algorithms that exploit the specific structure of the 
given problem are normally superior and are often the 
only choice for very large instances. 

6 Techniques 

Since good algorithms have to exploit the structure 
of the problem, every problem requires different tech- 
niques. Some techniques are quite general and can be 
applied for a large variety of problems, but in many 
cases they will not work well. Nevertheless, we list the 
most important techniques that have been applied suc- 
cessfully to several combinatorial optimization prob- 
lems. 

6. 1 Reductions 

Reducing an unknown problem to a known (and solved) 
problem is of course the most important technique. To 
prove hardness, one proceeds the other way round: we 
reduce a problem that we know to be hard to a new 
problem (that must then also be hard). If reductions 
work in both ways, problems can actually be regarded 
to be equivalent. 

6.2 Enumeration Techniques 

Some problems can be solved by skillful enumeration. 
Dynamic programming is such a technique. It works 
if optimal solutions arise from optimal solutions to 
“smaller” problems by simple operations, dijkstra’s 
shortest-path algorithm [VI. 10] is a good example. 
Many algorithms on trees use dynamic programming. 

Branch and bound is another well-known enumera- 
tion technique. Here, one enumerates only parts of a 
decision tree because lower and upper bounds tell us 
that the unvisited parts cannot contain a better solu- 
tion. How well this works mainly depends on how good 
the available bounds are. 

6.3 Reducing or Decomposing the Instance 

Often, an instance can be preprocessed by removing 
irrelevant parts. In other cases one can compute a 


smaller instance or an instance with a certain struc- 
ture whose solution implies a solution of the original 
instance. 

Another well-known technique is divide and con- 
quer [1.4 §3]. In some problems, instances can be 
decomposed/partitioned into smaller instances whose 
solutions can then be combined in some way. 

6.4 Combinatorial or Algebraic Structures 

If the instances have a certain structure (like planarity 
or certain connectivity or sparsity properties of graphs, 
cross-free set families, matroid structures, submodular 
functions, etc.), this must usually be exploited. 

Also, optimal solutions (of relaxations or the origi- 
nal problem) often have a useful structure. Sometimes 
(e.g., by sparsification or uncrossing techniques) such a 
structure can be obtained even if it is not there to begin 
with. 

Many algorithms compute and use a combinatorial 
structure as a main tool. This is often a graph struc- 
ture, but sometimes an algebraic view can reveal cer- 
tain properties. For instance, the Laplacian matrix of 
a graph has many useful properties. Sometimes sim- 
ple properties, like parity, can be extremely useful and 
elegant. 

6.5 Primal-Dual Relations 

We discussed linear programming duality, a key tool for 
many algorithms, above. Lagrangian duality can also 
be useful for nonlinear problems, and sometimes other 
kinds of duality, like planar duality or dual matroids, 
are very useful. 

6.6 Improvement Techniques 

It is natural to start with some solution and iteratively 
improve it. The greedy algorithm and finding augment- 
ing paths can be considered as special cases. In general, 
some way of measuring progress is needed so that the 
algorithm will terminate. 

The general principle of starting with any feasible 
solution and iteratively improving it by small local 
changes is called local search. Local-search heuristics 
are often quite successful in practice, but in many cases 
no reasonable performance guarantees can be given. 

6.7 Relaxation and Rounding 

Relaxations can arise combinatorially (by allowing solu- 
tions that do not have a certain property that was orig- 
inally required for feasible solutions) or by omitting 
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integrality constraints of a description as an optimiza- 
tion problem over variables in M". 

Linear programming formulations can imply polyno- 
mial-time algorithms even if they have exponentially 
many variables or constraints (by the equivalence of 
optimization and separation). Linear relaxations can 
be strengthened by adding further linear constraints, 
called cutting planes. 

One can also consider nonlinear relaxations. In par- 
ticular, semidefinite relaxations have been used for 
some approximation algorithms. 

Of course, after solving a relaxation, the originally 
required property must be restored somehow. If a frac- 
tional solution is made integral, this is often called 
rounding. Note that rounding is used here in a gen- 
eral sense (deriving an integral solution from a frac- 
tional one), and not specifically meaning rounding to 
the nearest integer. Sophisticated rounding algorithms 
for various purposes have been developed. 

6.8 Scaling and Rounding 

Often, a problem becomes easier if the numbers in the 
instance are small integers. This can be achieved by 
scaling and rounding, of course at the cost of a loss 
of accuracy. The knapsack problem (see section 1.4) 
is a good example; the best algorithms use scaling 
and rounding and then solve the rounded instance by 
dynamic programming. 

In some cases a solution of the rounded instance can 
be used in subsequent iterations to obtain more accu- 
rate, or even exact, solutions of the original instance 
more quickly. 

6.9 Geometric Techniques 

The role that geometric techniques play is also becom- 
ing more important. Describing (the convex hull of) fea- 
sible solutions by a polyhedron is a standard technique. 
Planar embeddings of graphs (if they exist) can often be 
exploited in algorithms. Approximating a certain met- 
ric space by a simpler one is an important technique in 
the design of approximation algorithms. 

6.10 Probabilistic T echniques 

Sometimes, a probabilistic view makes problems much 
easier. For example, a fractional solution can be viewed 
as a convex combination of extreme points, or as a 
probability distribution. Arguing over the expectation 


of some random variables can lead to simple algo- 
rithms and proofs. Many randomized algorithms can 
be derandomized, but this often complicates matters. 

Further Reading 

Korte, B., and J. Vygen. 2012. Combinatorial Optimization: 

Theory > and Algorithms, 5th edn. Berlin: Springer. 
Schrijver, A. 2003. Combinatorial Optimization: Polyhedra 
and Efficiency. Berlin: Springer. 


IV.39 Algebraic Geometry 

Frank Sottile 


Physical objects and constraints may be modeled by 
polynomial equations and inequalities. For this reason, 
algebraic geometry, the study of solutions to systems 
of polynomial equations, is a tool for scientists and 
engineers. Moreover, relations between concepts aris- 
ing in science and engineering are often described by 
polynomials. Whatever their source, once polynomials 
enter the picture, notions from algebraic geometry— 
its theoretical base, its trove of classical examples, and 
its modern computational tools— may all be brought to 
bear on the problem at hand. 

As a part of applied mathematics, algebraic geometry 
has two faces. One is an expanding list of recurring 
techniques and examples that are common to many 
applications, and the other consists of topics from 
the applied sciences that involve polynomials. Link- 
ing these two aspects are algorithms and software for 
algebraic geometry. 

1 Algebraic Geometry for Applications 

We present here some concepts and objects that are 
common in applications of algebraic geometry. 

1.1 Varieties and Their Ideals 

The fundamental object in algebraic geometry is a vari- 
ety (or an affine variety), which is a set in the vector 
space C n (perhaps restricted to M" for an application) 
defined by polynomials, 

V(S) := {xeC n \f(x) =0 V/gS}, 

where S c C[x] = C[xi,...,x n ] is a set of poly- 
nomials. Common geometric figures— points, lines, 
planes, circles, conics, spheres, etc.— are all algebraic 
varieties. Questions about everyday objects may there- 
fore be treated with algebraic geometry. 
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The real points of the line x + y - 1 = 0 are shown 
on the left-hand side of the figure below: 


oo 




Its complex points are the Argand plane C embedded 
obliquely in C 2 . 

We may compactify algebraic varieties by adding 
points at infinity. This is done in projective space P”, 
which is the set of lines through the origin in C n+1 
(or RP* for R n+1 ). This may be thought of as €” with 
a p™- 1 at infinity, giving directions of lines in C”. 
The projective line P 1 is the Riemann sphere on the 
right-hand side of the above figure. 

Points of P" are represented by (n + l)-tuples of 

homogeneous coordinates, where [xo x n ] = [Axo, 

. . . , Ax n ] if A + 0 and at least one x, is nonzero. Pro- 
jective varieties are subsets of P” defined by homoge- 
neous polynomials in xo, . . . ,x n . 

To a subset Z of a vector space we associate the set 
of polynomials that vanish on Z: 

I(Z):= {/6C[x]|/(z) = 0Vz6Z}. 

Let f,g,h e C[x], with/, g vanishing on Z. Both/ + g 
and h ■ f then vanish on Z, which imphes that I ( Z ) is an 
ideal of the polynomial ring C[xi, . . . , x n ]. Similarly, if 
I is the ideal generated by a set S of polynomials, then 
V(S) = V(I). 

Both V and I reverse inclusions with S c /( V(S) ) and 
Z c V {I (Z ) ), with equality when Z is a variety. Thus we 
have the correspondence 

V 

{ideals} +=? {varieties} 

linking algebra and geometry. By Hilbert’s Nullstellen- 
satz, this correspondence is bijective when restricted to 
radical ideals (f N e I => / 6 I). This allows ideas and 
techniques to flow in both directions and is the source 
of the power and depth of algebraic geometry. 

The fundamental theorem of algebra asserts that a 
nonconstant univariate polynomial has a complex root. 
The Nullstellensatz is a multivariate version, for it is 
equivalent to the statement that, if I C C[x] is a proper 
ideal, then V(I) + 0. 


It is essentially for this reason that algebraic geom- 
etry works best over the complex numbers. Many appli- 
cations require answers whose coordinates are real 
numbers, so results from algebraic geometry are often 
filtered through the lens of the real numbers when used 
in applications. While this restriction to R poses signifi- 
cant challenges for algebraic geometers, the generaliza- 
tion from M to € and then on to projective space often 
makes the problems easier to solve. The solution to this 
useful algebraic relaxation is often helpful in treating 
the original application. 


1.2 Parametrization and Rationality 


Varieties also occur as images of polynomial maps. For 
example, the map t — (t 2 — 1, t 3 — t) = (x,y) has as its 
image the plane cubic y 2 = x 3 + x 2 : 



Given such a parametric representation of a variety 
(or any other explicit description), the implicitization 
problem asks for its ideal. 

The converse problem is more subtle: can a given 
variety be parametrized? Euclid and Diophantus dis- 
covered the rational parametrization of the unit circle 
x 2 + y 2 = 1, t •- (x,y), where 


x = 


2 1 

1+72 


and y 


1 -t 2 
1 + t 2 ' 


( 1 ) 


This is the source of both Pythagorean triples and the 
rationalizing substitution z = tan( 3 0) of integral cal- 
culus. Homogenizing by setting t = a/b, (1) gives 
an isomorphism between P 1 (with coordinates [a, b ]) 
and the unit circle. Translating and scaling gives an 
isomorphism between P 1 and any circle. 

On the other hand, the cubic y 2 = x 3 - x (on the 
left-hand side below) has no rational parametrization: 

t 

Y 



This is because the corresponding cubic in P 2 is a curve 
of genus one (an elliptic curve), which is a torus (see 
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the right-hand side above), and there is no nonconstant 
map from the Riemann sphere P 1 to the toms. However, 
(x,y) >— x sends the cubic curve to P 1 and is a two-to- 
one map except at the branch points {-1,0, 1,°°}. In 
fact, any curve with a two-to-one map to P 1 having four 
branch points has genus one. 

A smooth biquadratic curve also has genus one. The 
product P 1 x P 1 is a compactification of C 2 that is dif- 
ferent from P 2 . Suppose that C c P 1 x P 1 is defined 
by an equation that is separately quadratic in the two 
variables s and t, 

0.00 "t tliO-S "t O-Olt + ■ ■ ■ + — 0, 

where s and t are coordinates for the P 1 factors. Ana- 
lyzing the projection onto one P 1 factor, one can show 
that the map is two-to-one, except at four branch 
points, and so C has genus one. 

1.3 Toric Varieties 

Varieties parametrized by monomials (toric varieties) 
often arise in applications, and they may be completely 
understood in terms of the geometry and combinator- 
ics of the monomials. 

Let C* be the nonzero complex numbers. An inte- 
ger vector a = (ai a d) £ Z d is the exponent vec- 

tor of a Laurent monomial t a := tf 1 ■ • • t^ d , where 
t = ( ti,...,td ) £ (C*) d is a d-tuple of nonzero com- 
plex numbers. Let JA = {ao, . . . , c Z d be a finite 
set of integer vectors. The toric variety Vt is then the 
closure of the image of the map 

q?A- (C*) d 3 t — £ V n . 

The toric variety Xa has dimension equal to the dimen- 
sion of the affine span of JA, and it has an action of 
(C*) d (via the map cpa) with a dense orbit (the image 
of <Pa)- 

The implicitization problem for toric varieties is ele- 
gantly solved. Assume that JA lies on an affine hyper- 
plane, so that there is a vector w 6 R d with w ■ «/ = 

iv ■ tXj (=£ 0) for all i, j, where “■ ” is the dot product. For 

v £ R n+1 , write JAv for Xi oqvj. 

Theorem 1. The homogeneous ideal of Xa is spanned 
by binomials x u - x v , where JAu = JAv. 

The assumption that we have iv with w ■ = w ■ <Xj 

for all i, j is mild. Given any JA, if we append a new 
(d + l)th coordinate of 1 to each oq and set w = 
(0, ...,0, 1) £ R d+1 , then the assumption is satisfied 
and we obtain the same projective variety Xa- 


Applications also use the tight relation between Xa 
and the convex hull Aa of JA, which is a polytope 
with integer vertices. The points of Xa with nonneg- 
ative coordinates form its nonnegative partX This is 
identified with A a through the algebraic moment map, 
tta ■ P n --+ V d , which sends a point x to JAx. (The bro- 
ken arrow means that the map is not defined every- 
where.) By Birch’s theorem from statistics, tta maps 
X \ homeomorphically to A a- 
There is a second homeomorphism Pa '■ &a — * %a 
given by polynomials. The polytope Aa is defined by 
linear inequalities, 

Aa := {x 6 R d | f F (x) ^ 0}, 

where F ranges over the codimension-one faces of A a 
and £f(F) = 0, with the coefficients of f F coprime 
integers. For each a £ JA, set 

p a (x)-.= \\e F {x)^ a) , (2) 

F 

which is nonnegative on Aj[. For x e Aju, set 

Pa(x) := [fiao (x), . . . , Pa n (x)] £ X 

While tta and Pa are homeomorphisms between the 
same spaces, they are typically not inverses. 

A useful variant is to translate Xa by a nonzero 
weight, to = (too, . . . , to«) £ (<C*)" +1 , 

Xa.w ■= {[to 0 x 0 ,...,to„x n ] | x £ Xa}- 

This translated toric variety is spanned by binomials 
w v x u - w ll x v with JAu = JAv as in Theorem 1, and it 
is parametrized by monomials via 

<PA,w(t) = (to 0 t“° to„t“"). 

When the weights to; are positive real numbers, Birch’s 
theorem holds, tta.: X\ m — Aa, and we have the 
parametrization Pa,co- Aa — X^ w , where the compo- 
nents of Pa, co are iOiP ai . 

Example 2. When JA consists of the standard unit 
vectors (1, 0, . . . , 0), . . . , (0, . . . , 0, 1) in R n+1 , the toric 
variety is the projective space P n , and cpA gives the 
usual homogeneous coordinates [xo,...,x n ] for P n . 
The nonnegative part of P™ is the convex hull of JA, 
which is the standard n-simplex, £E] n , and tta = Pa is 
the identity map. 

Example 3. Let JA = {0, 1, . . . ,n} so that Aa = [0, n], 
and choose weights tOj = (jfj . Then Xa,u> is the closure 
of the image of the map 

'-[‘■"'■GO ' 2 "‘""'"H" 
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which is the (translated) moment curve. Its nonnegative 
part X w is the image of [0, n] under the map Pa.co 
whose components are 

Pi(x) = 

n n y i ) 

Replacing x by ny gives the Bernstein polynomial 

Pun(y) = yl)yHi-y) n - i , (3) 

and thus the moment curve is parametrized by the 
Bernstein polynomials. Because of this, we call the 
functions (2) generalized Bernstein polynomials. 

The composition ttja ° Pa,u>(x) is 



as the last sum is (x + (n-x))" -1 . Similarly, (l/n)7Tji ° 
P(y) = y, where /? is the parametrization by the Bern- 
stein polynomials. The weights to* = are essentially 
the unique weights for which ttja ° Pa,uj (x) = x. 

Example 4. For positive integers m, n consider the 
map cp : C m xC"- V(C mxn ) defined by 

(x,y) » [xiyj | i = l,...,m, j = l,...,n]. 

Its image is the Segre variety, which is a toric variety, 
as the map cp is q) 'A , where JA is 

{ei + fj | i = 1, . . . , m, j = 1, . . . , n} c Z m © Z n . 

Here, {ef and {fj} are the standard bases for Z m and 
Z n , respectively. 

If Zij are the coordinates of C mxn , then the Segre 
variety is defined by the binomial equations 

Zij Zu 

ZijZki — ZuZ]cj — 

z kj Zkl 

Identifying C mxn with m x n matrices shows that the 
Segre variety is the set of rank-one matrices. 

Other common toric varieties include the Veronese 
variety, where PA is JA n> a := nf\7 d n Z d+1 , and the 
Segre-Veronese variety, where JA is JA mt d x JA n ,e- When 
d = e = 1,JA consists of the integer vectors in the m x n 
rectangle 

JA = {( i,j ) | 0 ^ i ^ m, 0 ^ j ^ n}. 


2 Algorithms for Algebraic Geometry 

Mediating between theory and examples and facilitat- 
ing applications are algorithms developed to study, 
manipulate, and compute algebraic varieties. These 
come in two types: exact symbolic methods and approx- 
imate numerical methods. 

2.1 Symbolic Algorithms 

The words algebra and algorithm share an Arabic root, 
but they are connected by more than just their his- 
tory. When we write a polynomial— as a sum of mono- 
mials, say, or as an expression such as a determi- 
nant of polynomials— that symbolic representation is 
an algorithm for evaluating the polynomial. 

Expressions for polynomials lend themselves to algo- 
rithmic manipulation. While these representations and 
manipulations have their origin in antiquity, and meth- 
ods such as Grobner bases predate the computer age, 
the rise of computers has elevated symbolic compu- 
tation to a key tool for algebraic geometry and its 
applications. 

Euclid’s algorithm, Gaussian elimination, and Sylves- 
ter’s resultants are important symbolic algorithms that 
are supplemented by universal symbolic algorithms 
based on Grobner bases. They begin with a term order 
<, which is a well-ordering of all monomials that is 
consistent with multiplication. For example, < could 
be the lexicographic order in which x u < x v if the 
first nonzero entry of the vector v - u is positive. A 
term order organizes the algorithmic representation 
and manipulation of polynomials, and it is the basis 
for the termination of algorithms. 

The initial term in,-/ of a polynomial / is its term 
c a x a with the x-largest monomial in /. The initial 
ideal in.-/ of an ideal / is the ideal generated by ini- 
tial terms of polynomials in I. This monomial ideal is a 
well-understood combinatorial object, and the passage 
to an initial ideal preserves much information about I 
and its variety. 

A Grobner basis for I is a finite set Gel of poly- 
nomials whose initial terms generate in< /. This set G 
generates I and facilitates the transfer of information 
from in. / back to I. This information may typically 
be extracted using linear algebra, so a Grobner basis 
essentially contains all the information about / and its 
variety. 

Consequently, a bottleneck in this approach to sym- 
bolic computation is the computation of a Grobner 
basis (which has high complexity due in part to its 
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information content). Grobner basis calculation also 
appears to be essentially serial; no efficient parallel 
algorithm is known. 

The subject began in 1965 when Buchberger gave 
an algorithm to compute a Grobner basis. Decades of 
development, including sophisticated heuristics and 
completely new algorithms, have led to reasonably effi- 
cient implementations of Grobner basis computation. 
Many algorithms have been devised and implemented 
to use a Grobner basis to study a variety. All of this 
is embedded in freely available software packages that 
are revolutionizing the practice of algebraic geometry 
and its applications. 

2.2 Numerical Algebraic Geometry 

While symbolic algorithms lie on the algebraic side 
of algebraic geometry, numerical algorithms, which 
compute and manipulate points on varieties, have a 
strongly geometric flavor. 

These numerical algorithms rest upon Newton’s 
method for refining an approximate solution to a sys- 
tem of polynomial equations. A system F = ( f\ /„) 

of polynomials in n variables is a map F : C n -> 
C n with solutions .F _1 (0). We focus on systems with 
finitely many solutions. A Newton iteration is the map 
Nr:C n — C n , where 

n f (x) = x- df; 1 (F(x)), 

with DF X the Jacobian matrix of partial derivatives of 
F at x. If § e F (0) is a solution to F with DF% invert- 
ible, then when x is sufficiently close to g, Np(x) is 
closer still, in that it has twice as many digits in com- 
mon with 5 as does x. Smale showed that “sufficiently 
close” may be decided algorithmically, which can allow 
the certification of output from numerical algorithms. 

Newton iterations are used in numerical continu- 
ation. For a polynomial system H t depending on a 
parameter t, the solutions Hf l ( 0) for t e [0, 1] form a 
collection of arcs. Given a point (x t , t) of some arc and 
a step St, a predictor is called to give a point (x', t + 8t) 
that is near to the same arc. Newton iterations are then 
used to refine this to a point ( Xt+s, , t + St) on the arc. 
This numerical continuation algorithm can be used to 
trace arcs from t = 0 to t = 1. 

We may use continuation to find all solutions to a sys- 
tem F consisting of polynomials ft of degree d. Define 
a new system H t = (h \ h n ) by 

hi := tfi + (1 - t)(xf - 1). 


At t = 0, this is xf — 1, whose solutions are the dth 
roots of unity. When F is general, Hf 1 ! 0) consists of 
d n arcs connecting these known solutions at t = 0 to 
the solutions of F _1 (0) at t = 1. These may be found 
by continuation. 

While this Bezout homotopy illustrates the basic idea, 
it has exponential complexity and may not be efficient. 
In practice, other more elegant and efficient homotopy 
algorithms are used for numerically solving systems of 
polynomials. 

These numerical methods underlie numerical alge- 
braic geometry, which uses them to manipulate and 
study algebraic varieties on a computer. The subject 
began when Sommese, Verschelde, and Wampler intro- 
duced its fundamental data structure of a witness 
set, as well as algorithms to generate and manipulate 
witness sets. 

Suppose we have a variety V c C n of dimension 
n - d that is a component of the zero set F _1 ( 0) of 
d polynomials F = (/i , . . . , fa) ■ A witness set for V con- 
sists of a general affine subspace i c C n of dimen- 
sion d (given by d affine equations) and (approxima- 
tions to) the points of V n L. The points of V n L may 
be numerically continued as L moves to sample points 
from V. 

An advantage of numerical algebraic geometry is that 
path tracking is inherently parallelizable, as each of the 
arcs in Hf 1 (0) may be tracked independently. This par- 
allelism is one reason why numerical algebraic geom- 
etry does not face the complexity affecting symbolic 
methods. Another reason is that by computing approx- 
imate solutions to equations, complete information 
about a variety is never computed. 

3 Algebraic Geometry in Applications 

We illustrate some of the many ways in which algebraic 
geometry arises in applications. 

3.1 Kinematics 

Kinematics is concerned with motions of linkages (rigid 
bodies connected by movable joints). While its origins 
were in the simple machines of antiquity, its impor- 
tance grew with the age of steam and today it is fun- 
damental to robotics [VI. 14]. As the positions of a 
linkage are solutions to a system of polynomial equa- 
tions, kinematics has long been an area of application 
of algebraic geometry. 

An early challenge important to the development of 
the steam engine was to find a linkage with a motion 
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along a straight line. Watt discovered a linkage in 1784 
that approximated straight line motion (tracing a curve 
near a flex), and in 1864 Peaucelher gave the first link- 
age with a true straight line motion (based on circle 
inversion): 



(When the bar B is rotated about its anchor point, the 
point P traces a straight line.) 

Cayley, Chebyshev, Darboux, Roberts, and others 
made contributions to kinematics in the nineteenth 
century. The French Academy of Sciences recognized 
the importance of kinematics, posing the problem of 
determining the nontrivial mechanisms with a motion 
constrained to he on a sphere for its 1904 Prix Vaillant, 
which was awarded to Borel and Bricard for their partial 
solutions. 

The four-bar linkage consists of four bars in the plane 
connected by rotational joints, with one bar fixed. A 
triangle is erected on the coupler bar opposite the fixed 
bar, and we wish to describe the coupler cun'e traced 
by the apex of the triangle: 



To understand the motion of this linkage, note that if 
we remove the coupler bar C, the bars B and B' swing 
freely, tracing two circles, each of which we parame- 
terize by P 1 as in (1). The coupler bar constrains the 
endpoints of bars B and B' to lie a fixed distance apart. 
In the parameters 5, t of the circles and if b, b' , c 
are the lengths of the corresponding bars, the coupler 


constraint gives the equation 


(‘r 


+ 5 


7 , 1 - t 

7C l ) ~ 0 I 

2 1 +t 2 J 


V 


25 


2 1 


l 1 + 5 2 1 + t 2 , 


fc 2 + b' 1 - 2 bb 


, (1 - 5 2 ) (1 - t 2 ) + 45t 
(1 +5 2 )(1 + t 2 ) ■ 


Clearing denominators gives a biquadratic equation in 
the variety P 1 x P 1 that parametrizes the rotations of 
bars B and B'. The coupler curve is therefore a genus- 
one curve and is irrational. The real points of a genus- 
one curve have either one or two components, which 
corresponds to the linkage having one or two assembly 
modes; to reach all points of a coupler curve with two 
components requires disassembly of the mechanism. 

Roberts and Chebyshev discovered that there are 
three linkages (called Roberts cognates) with the same 
coupler curve, and they may be constructed from one 
another using straightedge and compass. The nine- 
point path synthesis problem asks for the four -bar link- 
ages whose coupler curve contains nine given points. 
Morgan, Sommese, and Wampler used numerical con- 
tinuation to solve the equations, finding 4326 distinct 
linkages in 1442 triplets of Roberts cognates. Here is 
one linkage that solves this problem for the indicated 
nine points: 



Such applications in kinematics drove the early devel- 
opment of numerical algebraic geometry. 


3.2 Geometric Modeling 

Geometric modeling uses curves and surfaces to repre- 
sent objects on a computer for use in industrial design, 
manufacture, architecture, and entertainment. These 
applications of computer-aided geometric design and 
computer graphics are profoundly important to the 
world economy. 

Geometric modeling began around 1960 in the work 
of de Casteljau at Citroen, who introduced what are 
now called Bezier curves (they were popularized by 
Bezier at Renault) for use in automobile manufacturing. 

Bezier curves (along with their higher-dimensional 
analogues rectangular tensor-product and triangular 


576 


IV. Areas of Applied Mathematics 


Bezier patches) are parametric curves (and surfaces) 
that have become widely used for many reasons, includ- 
ing ease of computation and the intuitive method to 
control shape by manipulating control points. They 
begin with Bernstein polynomials (3), which are non- 
negative on [0, 1]. Expanding 1™ = (f + (1 - t)) n shows 
that 

n 

1 = X! 

i-1 

Given control points bo b n in R 2 (or R 3 ), we have 

the Bezier curve 

n 

[0,l]3t«XMi,«(t). (4) 

i=0 

Here are two cubic ( n = 3) Bezier curves in R 2 : 




By (4), a Bezier curve is the image of the nonnega- 
tive part of the translated moment curve of example 3 
under the map defined on projective space by 

n 

[Xo ,...,X n ] -> X x i b i- 
i = 0 

On the standard simplex /^J n , this is the canonical map 
to the convex hull of the control points. 

The tensor product patch of bidegree ( m,n ) has 
basis functions 

fii.m ($) fij.m ( 1) 

for i = 0, . . . , m and j = 0, . . . , n. These are functions 
on the unit square. Control points 

{bij | i = 0, . . . , m, j = 0, ... ,n} cl 3 


determine the map 
(s,t) — 


whose image is a rectangular patch. 

Bezier triangular patches of degree d have basis 
functions 




d\ 

i\j\(d - i - j)\ 


s l tH 1 - s - t) d 1 i 


for 0 ^ i,j with i + j ^ d. Again, control points give 
a map from the triangle with image a Bezier triangular 
patch. Here are two surface patches: 



These patches correspond to toric varieties, with ten- 
sor product patches coming from Segre-Veronese sur- 
faces and Bezier triangles from Veronese surfaces. The 
basis functions are the generalized Bernstein polyno- 
mials U 0 iji ai of section 1.3, and this explains their 
shape as they are hnages of A_a, which is a rectangle 
for the Segre-Veronese surfaces and a triangle for the 
Veronese surfaces. 

An important question is to determine the intersec- 
tion of two patches given parametrically as F(x) and 
G(x) for x in some domain (a triangle or rectangle). 
This is used for trimming the patches or drawing the 
intersection curve. A common approach is to solve the 
implicitization problem for G, giving a polynomial g 
which vanishes on the patch G. Then g(F(x)) defines 
the intersection in the domain of F. This application 
has led to theoretical and practical advances in algebra 
concerning resultants and syzygies. 

3.3 Algebraic Statistics 

Algebraic statistics applies tools from algebraic geom- 
etry to questions of statistical inference. This is possi- 
ble because many statistical models are (part of) alge- 
braic varieties, or they have significant algebraic or 
geometric structures. 

Suppose that X is a discrete random variable with 
n + 1 possible states, 0, . . . , n (e.g., the number of tails 
observed in n coin flips). If pi is the probability that X 
takes value i, 

Pi :=P(X=i), 

then po, ... ,Pn are nonnegative and sum to 1. Thus p 
lies in the standard n-simplex, £xl n . Here are two views 
of it when n = 2: 



A statistical model M is a subset of j£s7 n . If the point 
(po.---.Pn) £ M, then we may think of X as being 
explained by M. 
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Example 5. Let Ibea discrete random variable whose 
states are the number of tails in n flips of a coin with a 
probability t of landing on tails and 1 - t of heads. We 
may calculate that 

P(X = i) = - t) n ~\ 

which is the Bernstein polynomial (3) evaluated at 
the parameter t. We call X a binomial random variable 
or binomial distribution. The set of binomial distribu- 
tions as t varies gives the translated moment curve of 
example 3 parametrized by Bernstein polynomials. This 
curve is the model for binomial distributions. Here is a 
picture of this curve when n = 2: 


Pi 



Example 6. Suppose that we have discrete random 
variables X and Y with m and n states, respectively. 
Their joint distribution has mn states (cells in a table or 
a matrix) and lies in the simplex The model of 

independence consists of all distributions p e A7 mw ~ 1 
such that 

P(X=i,Y = j)=P(X=i)P(Y = j). (5) 

It is parametrized by x ^ 7" (probability sim- 

plices for X and Y), and (5) shows that it is the nonneg- 
ative part of the Segre variety of example 4. The model 
of independence therefore consists of those joint prob- 
ability distributions that are rank-one matrices. 

Other common statistical models called discrete 
exponential families or toric models are also the non- 
negative part X\ w of some toric variety. For these, the 
algebraic moment map ttjt : £y] n — A a (or u •- JAu) is 
a sufficient statistic. For the model of independence, JAu 
is the vector of row and column sums of the table u. 

Suppose that we have data from N independent 
observations (or draws), each from the same distribu- 
tion p(t) from a model M, and we wish to estimate the 
parameter t best explaining the data. One method is to 
maximize the likelihood (the probability of observing 
the data given a parameter t). Suppose that the data are 
represented by a vector u of counts, where U{ is how 
often state i was observed in the N trials. The likelihood 
function is 

L(t\u) = ( N \ n Pi(t) u ‘ , 
v u / t=o 

where ) is the multinomial coefficient. 


Suppose that M is the binomial distribution of exam- 
ple 5. It suffices to maximize the logarithm of L(t \ u), 
which is 

n 

C+X “i (i log t + (n - i) log(l - t) ), 
i=0 


where C is a constant. By calculus, we have 

1 n | n 

0 = 7 X iu i + T~7 X “ t) u i- 

1 i=0 11 i = 0 

Solving, we obtain that 


t 


1 

n 


f i— 

h N 


( 6 ) 


maximizes the likelihood. If u := ( u/N ) e £y] n is the 
point corresponding to our data, then (6) is the nor- 
malized algebraic moment map (l/n)nji of example 3 
applied to u. For a general toric model X^ w c A7". 
and likelihood is maximized at the parameter t satisfy- 
ing ttja ° [i.'A.io ( t) = ttji(u). An algebraic formula exists 
for the parameter that maximizes likelihood exactly 
when rrjzi and fji.w are inverses. 

Suppose that we have data u as a vector of counts 
as before and a model M c £y] n and we wish to test 
the null hypothesis that the data u come from a dis- 
tribution in M. Fisher’s exact test uses a score function 
A7 n — Mj, that is zero exactly on M and computes 
how likely it is for data v to have a higher score than 
u, when v is generated from the same probability dis- 
tribution as u. This requires that we sample from the 
probability distribution of such v. 

For a toric model X ^ w , this is a probability distribu- 
tion on the set of possible data with the same sufficient 
statistics: 


f u ■= { V | JAu = JAv}. 


For a parameter t, this distribution is 

( N v )w v t Av 


L(v \ v e f u ,t) = 




(*)co w t Aw 


1 

(N\ 

\v) 

co v 

'EjWGfu 


\C!O w 


(7) 


as JAv = JAw for v, w e ff(u). 

This sampling may be accomplished using a random 
walk on the fiber with stationary distribution (7). 
This requires a connected graph on jF u . Remarkably, 
any Grobner basis for the ideal of the toric variety X_& 
gives such a graph. 
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3.4 Tensor Rank 

The fundamental invariant of an m x n matrix is its 
rank. The set of all matrices of rank at most r is defined 
by the vanishing of the determinants of all (r + 1) x 
(r+ 1) submatrices. From this perspective, the simplest 
matrices are those of rank one, and the rank of a matrix 
A is the minimal number of rank-one matrices that sum 
to A. 

An m x n matrix A is a linear map V 2 = C n — ■ Vi = 
C m . If it has rank one, it is the composition 

y 2 JX C ^ Vi 

of a linear function xy on V 2 (V2 e V*) and an inclusion 
given by 1 -> v\ e Vi. Thus, A = Vi 0 ry £ Vi ® V 2 , as 
this tensor space is naturally the space of linear maps 
V 2 — Vi . A tensor of the form ui 0 V 2 has rank one, 
and the set of rank-one tensors forms the Segre variety 
of example 4. 

singular value decomposition [11.32] writes a 
matrix A as a sum of rank-one matrices of the form 
rank (A) 

A = X a i v l,i ® v 2,i, (8) 

1 = 1 

where {vi,f} and {V 2 ,ij are orthonormal and ay ^ 
■ ■ ■ > Orank(A) > 0 are the singular values of A. Often, 
only the relatively few terms of (8) with largest singu- 
lar values are significant, and the rest are noise. Letting 
Air be the sum of terms with large singular values and 
Anoise be the sum of the rest, then A is the sum of the 
low-rank matrix Ai r plus noise A n0 ise- 

A k-way tensor (k- way table) is an element of the 
tensor space Vi 0 ■ ■ ■ 0 V*, where each Vi is a finite- 
dimensional vector space. A rank-one tensor has the 
form i>i ® ■ ■ ■ ® Vk, where i y e Vi. These form a 
toric variety, and the rank of a tensor v is the minimal 
number of rank-one tensors that sum to v. 

The (closure of) the set of rank -r tensors is the rth 
secant variety'. When k = 2 (matrices), the set of deter- 
minants of all (r + 1) x (r + 1) submatrices solves the 
implicitization problem for the rth secant variety. For 
k > 2 there is not yet a solution to the implicitization 
problem for the rth secant variety. 

Tensors are more complicated than matrices. Some 
tensors of rank greater than r lie in the rth secant vari- 
ety, and these may be approximated by low-rank (rank- 
r) tensors. Algorithms for tensor decomposition gener- 
alize singular value decomposition. Their goal is often 
an expression of the form v = vi T + i/ n0 i Se for a tensor 
v as the sum of a low-rank tensor ui r plus noise u n0 i S e- 


Some mixture models in algebraic statistics are 
secant varieties. Consider an inhomogeneous popula- 
tion in which the fraction 0 j obeys a probability dis- 
tribution from a model M. The distribution of 
data collected from this population then behaves in the 
same was as the convex combination 

0i p (1) + 02 p <2) + ■ ■ • + 6 r p [r \ 

which is a point on the rth secant variety of M. 

Theoretical and practical problems in complexity 
may be reduced to knowing the rank of specific tensors. 
Matrix multiplication gives a nice example of this. 

Let A = (ay) and B = ( fey ) be 2 x 2 matrices. In the 
usual multiplication algorithm, C = AB is 

dj = anbij + a 12 b2j, i,j = 1,2. (9) 

This involves eight multiplications. For nxn matrices, 
the algorithm uses n 3 multiplications. 

Strassen discovered an alternative that requires only 
seven multiplications (see the formulas in algorithms 
[1.4 §4]). If A and B are 2k x 2k matrices with k x k blocks 
aij and by, then these formulas apply and enable the 
computation of AB using 7k 3 multiplications. Recur- 
sive application of this idea enables the multiplication 
of nxn matrices using only n log2 7 = n 2 - 81 multiplica- 
tions. This method is used in practice to multiply large 
matrices. 

We interpret Strassen's algorithm in terms of ten- 
sor rank. The formula (9) for C = AB is a tensor 
p e V ® V* ® V*, where V = M 2 x 2 (C). Each multi- 
plication is a rank-one tensor, and (9) exhibits p as a 
sum of eight rank-one tensors, so p has rank at most 
eight. Strassen’s algorithm shows that p has rank at 
most seven. We now know that the rank of any ten- 
sor in V 0 V* 0 V* is at most seven, which shows how 
Strassen’s algorithm could have been anticipated. 

The fundamental open question about the complex- 
ity of multiplying nxn matrices is to determine the 
rank r n of the multiplication tensor. Currently, we 
have bounds only for r n : we know that r n ^ o(n 2 ), 
as matrices have n 2 entries, and improvements to 
the idea behind Strassen’s algorithm show that r n < 

q (^ 2 . 3728639 ) 

3.5 The Hardy-Weinberg Equilibrium 

We close with a simple application to Mendelian genet- 
ics. 

Suppose that a gene exists in a population in two 
variants (alleles) a and b. Individuals will have one 
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of three genotypes, aa, ab, or bb, and their distribu- 
tion p = (Paa,Pab,Pbb) is a point in the probability 
2 -simplex. A fundamental and originally controversial 
question in the early days of Mendelian genetics was the 
following: which distributions are possible in a popu- 
lation at equilibrium? (Assuming no evolutionary pres- 
sures, equidistribution of the alleles among the sexes, 
etc.) 

The proportions q a and qj, of alleles a and b in the 
population are 

Qa = Paa + y.P a ^ ’ ( / h — 2 Pab "t Pbb, (10) 


and the assumption of equilibrium is equivalent to 
Paa = da’ Pab = 2 q a qb, Pbb = db- (H) 


If 



1 0 
1 2 


with 




and 



*- bb, 


then (10) is (q a ,qb) = \TT A {p aa ,Pab,Pbb), the nor- 
malized algebraic moment map of examples 3 and 5 
applied to (p a a,Pab,Pbb)- Similarly, the assignment 
q — p of (11) is the parametrization /I of the trans- 
lated quadratic moment curve of example 3 given by 
the Bernstein polynomials. 

Since \tta °Piq) = q, the populations at equilibrium 
if and only if the distribution (p a a, Pab, Pbb ) of alleles 
lies on the translated quadratic moment curve, that is, 
if and only if it is a point in the binomial distribution, 
which we reproduce here: 


Pi 



This is called the Hardy-Weinberg equilibrium after its 
two independent discoverers. 

The Hardy in question is the great English mathe- 
matician G. H. Hardy, who was known for his disdain 
for applied mathematics, and this contribution came 
early in his career, in 1908. He was later famous for his 
work in number theory, a subject that he extolled for its 
purity and uselessness. As we all now know, Hardy was 
mistaken on this last point for number theory underlies 
our modern digital world, from the security of financial 
transactions via cryptography to using error-correcting 


codes to ensure the integrity of digitally transmitted 
documents, such as the one you have now finished 
reading. 
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IV.40 General Relativity and 
Cosmology 

George F. R. Ellis 


General relativity theory is currently the best classi- 
cal theory of gravity. It introduced a major new theme 
into applied mathematics by treating geometry as a 
dynamic variable, determined by the einstein field 
equations [III. 10] (EFEs). After outlining the basic con- 
cepts and structure of the theory, I will illustrate its 
nature by showing how the EFEs play out in two of its 
main applications: the nature of black holes, and the 
dynamical and observational properties of cosmology. 

1 The Basic Structure: Physics in 
a Dynamic Space-Time 

General relativity extended special relativity (SR) by 
introducing two related new concepts. 

The first was that space-time could be dynamic. Not 
only was it not flat— so consideration of coordinate 
freedom became an essential part of the analysis— but 
it curved in response to the matter that it contained, 
via the EFEs (6). Consequently, as well as its dynam- 
ics, the boundary of space-time needs careful consid- 
eration, along with global causal relations and global 
topology. 

Second, gravity is not a force like any other known 
force: it is inextricably entwined with inertia and can 
be transformed to zero by a change of reference frame. 
There is therefore no frame-independent gravitational 
force as such. Rather, its essential nature is encoded 
in space-time curvature, generating tidal forces and 
relative motions. 
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1.1 Riemannian Geometry and Physics 


The geometry embodied in general relativity theory 
is four-dimensional Riemannian geometry, determined 
by a symmetric metric tensor gij (x k ). The basic mathe- 
matical tool needed to investigate this geometry is ten- 
sor calculus. This generalizes vector calculus, where 
vector fields are quantities whose components have 
one free index, X'(x m ), to similar quantities whose 
components have an arbitrary number of indices, 
T 1 \...i(x m ), some “up” and some “down,” where in 
the case of general relativity theory the indices i, j, k 
range over 0, 1, 2, 3 (there are four dimensions). 

General coordinate transformations are allowed, so 
physical relations are best described through tensor 
equations relating tensors with the same kinds of 
indices, e.g., T' '\... l (x m ) = S 1 J k ... l (.x m ). When 
one changes coordinates (denoted here by primed 
and unprimed indices), tensors transform in a linear 
way: 


- rCl'b ' 

c' d' 


j^ab a a' \b' \C a d 
1 cd /l a /Y b /x c' 7i 


(i) 


for suitable inverse transformation matrices A b ' b , A c c , 
where we use the Einstein summation convention (i.e., 
repeated indices are summed over all index values). 
This leads to the key feature that tensor equations 
that are true in one coordinate system will be true in 
any coordinate system. One can form linear combina- 
tions, products, and contractions of tensors, all pre- 
served under transformations (1). (More details of ten- 
sors and their manipulations are described in tensors 
and manifolds [11.33].) One can thereby define sym- 
metries in the indices: they can be symmetric (denoted 
by round brackets), antisymmetric (denoted by square 
brackets), or trace free. Symmetries are preserved when 
one changes coordinates, and they therefore define 
physically meaningful aspects of variables. 


1.1.1 Special Relativity Applies Locally at Each Point 


The way in which SR applies near each point, with four 
dimensions (one time, three space), is determined by 
the metric tensor g a b(x c ) = 0ba(x c ). This determines 
distances along curves x a (A) in space-time using the 
fundamental relation 


I = 



r / 

dx a dx b 

Jv 

dab dA dA 


du, 


where the infinitesimal squared distance “d5 2 ” can be 
positive or negative, as follows. 


• d5 2 < 0 means time-like curves (traced out by parti- 
cles moving at less than the speed of light); in this 
case, ds 2 = -dT 2 < 0 determines proper time 1 t 
along this curve, measured by perfect clocks. 

• ds 2 = 0 means null curves, representing motion at 
the speed of light. 

• d5 2 > 0 means this is not a possible particle 
path, as it implies motion at a speed greater than 
the speed of light; in this case, d5 2 = +dl 2 > 0 
determines spatial distance l along the curve. 

To see how this metric geometry works, we choose Car- 
tesian-like local coordinates 

(x^x^x^x 3 ) = ( t,x,y,z ) 
such that, at the point of interest, 

g a b = diag(-l, 1, 1, 1), 

with spatial distance given by dr 2 = dx 2 + d y 2 + dz 2 . 
Along a time-like curve 

x a (\) = (t , x (A) , y (A) , z(A)) 

moving at speed v = dr/dt relative to these coordin- 
ates, we then have 

dT 2 = -d5 2 = df 2 (l - v 2 ) i , 

dT VI - v 2 

the standard time-dilation factor of SR. Setting the 
curve parameter A to be t, the 4-velocity is u a = 
dx a /dT => g ab u a u b = -1. Changing the coordin- 
ates by a speed v in the x 1 -direction sets A a a = 
cos1i/?<5q 5^ + sinh/15i5a and gives the standard 
Lorentz transformation results for spatial distances 
and time with sinh/1 = v (see invariants and con- 
servation laws [11.21] and tensors and manifolds 
[ 11.33] for more on this area). 

Space-time has one time dimension (represented by 
goo = -1) and three spatial dimensions (g n = g 22 = 
g 33 = 1). It unifies objects that are separately defined 
in Newtonian theory as four-dimensional entities, and 
it shows how they relate to each other when relative 
motion takes place. For example, the electric and mag- 
netic fields are unified in a skew tensor F a b = F[ a b] 
such that E a = F ab u b ,H a = \n a defU f F de ,whereq a bcd 
is the totally skew-symmetric volume tensor. The way 
in which these fields relate to each other when relative 
motion takes place follows from the tensor transfor- 
mation law (1). The inverse metric g mn : g mn g n k = 5™ 
is used to raise indices, T ab = g ac g bd T b d, while g ab 
lowers them. 


1. We use units such that the speed of light c = 1. 
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1.1.2 Covariant Derivatives and Connections 

A key question is: how does one define vectors to be 
mutually parallel at points P and Q that are distant 
from each other? In a curved space-time this ques- 
tion has no absolute meaning; it can be defined only 
in terms of parallel transport along a curve joining 
the points. This is defined via a covariant derivative 
operator (denoted by a semicolon) that acts on all ten- 
sor fields and gives the covariant derivative along any 
curve (denoted by 5), so that if dx a /dA = X a is the 
tangent vector to the curve x a (A), 5Y a /5\ := Y a . b X b 
is the covariant derivative of a vector field Y a along 
these curves, which vanishes if and only if Y a is parallel 
transported along z a ( A) (i.e., as Y a is translated along 
the curve, it is kept parallel to its previous direction at 
each infinitesimal step). It is determined by a geometric 
structure called a connection F, which specifies which 
vectors are parallel at neighboring points. Its compo- 
nents F a bc represent the a-component of the covariant 
derivative of the c-basis vector in the direction of the 
b-basis vector. Then, for a vector field Y a , 

Y% = Y%+r a bc Y c , ( 2 ) 

the first term Y a b := 3 Y a /dx b being the apparent 
derivative relative to the basis vectors, and the second 
term being due to the rate of change of the basis vectors 
along the curve. 

In particular, 5X a / 5A := X a . b X b vanishes if and only 
if the curve direction is parallel transported along itself; 
its direction is unchanging, so it is a geodesic (the clos- 
est one can get to a straight line in a curved space- 
time). The associated curve parameter A is defined up 
to affine transformations A' = a\ + b (a, b constants) 
and is therefore called an affine parameter. Geodesics 
represent the motion of matter moving subject only 
to the effects of gravity and inertia (time-like curves: 
X a X a = -1) and the paths of light rays (null curves: 
X a X a = 0). 

The covariant derivative extends to arbitrary tensors 
by assuming that 

(a) the covariant derivative of a function is just the 
partial derivative, f- c = f, c ■= 3 // 3x c , and 

(b) it is linear, obeys the Leibniz rule, and commutes 
with contractions. 

All local physical equations should be written in terms 
of covariant derivatives, e.g., Maxwell’s equations in a 
curved space-time take the formF[ a £, ;c ] = 0, F ab . b = J a , 
where J a is the 4-current. 


1.1.3 Christoffel Relations 

The connection F is determined by assuming (i) that 
it is torsion free, r a bc = F a cb , and (ii) that the met- 
ric tensor is parallel propagated along arbitrary curves, 
0ab\c = 0, which means that magnitudes are preserved 
under parallel propagation. Together these require- 
ments determine the connection uniquely, linking par- 
allelism to metric properties. The connection compo- 
nents are given by the Christoffel relations 

r ' n ab = \a mn (dan.b + 0bn.a ~ 3ab,n )■ (3) 

The connection is therefore defined by derivatives of 
the metric. 

This relation leads to the key result that geodesics 
are extremal curves in space-time (i.e., path length is 
maximized or minimized along them). 

1.2 Space-Time Curvature and Field Equations 

Covariant derivatives do not commute in general, lead- 
ing to the concept of space-time curvature. For any 
vector field X a , the Ricci identity 

Y a Y a — Y^- 

A ;bc ~ A ;cb ~ K dbc A W 

determines this noncommutativity, where R a c i bc is the 
Riemann curvature tensor with symmetries 

Rabcd — R[ab][cd] = Rcdabi Ra[bcd] = 0- 
This leads to holonomy: parallel transport of a vector 
around a closed loop from a point P back to P causes a 
change in the vector there (a rotation or Lorentz trans- 
formation because parallel transfer preserves magni- 
tudes). 

The Ricci tensor and the Ricci scalar are the first and 
second contractions of this tensor, defined by 

Rbf = Rf a ba => Rbf = Rfb< R = g bc Rbf- 

By (4) and (2), the Ricci tensor is given in terms of the 
connection by 

n r'Cl r'CL , rC rd r6 r'CL 

K bf ~ 1 bf,a 1 af,b + 1 bf 1 ae 1 af J be- 

On using the Christoffel relations (3), these are second- 
order differential expressions in terms of the metric 
tensor. 

The curvature tensor obeys an important set of 
integrability conditions, namely, the Bianchi identities 

Rab[cd;e] = 0 

(roughly, the curl of the curvature tensor vanishes), 
implying the key divergence relations 

{R a b -\R3t)ib = 0 ( 5 ) 

on contracting twice. 
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1.2.1 Field Equations 

The matter present in space-time determines the space- 
time curvature through the EFEs 

Fab ~ ^Fflab + Ag a b = R T a b b>b ’:h — 0, (6) 
where T a b is the energy-momentum-stress tensor for 
all matter present, i< is the gravitational constant, and 
A is the cosmological constant. The implication in 
(6) follows from the contracted Bianchi identities (5): 
if the EFEs are satisfied, energy and momentum are 
necessarily conserved. 

Given suitable equations of state for the matter, these 
equations determine the dynamical evolution of the 
space-time, which through energy-momentum conser- 
vation relations determines the evolution of the matter 
in it. The commonly used model is T a b = (p + p)u a Ub + 
P0ab, characterizing a “perfect fluid” with energy den- 
sity p and pressure p, which are related by an equation 
of state p = pip). If p = 0, we have the simplest case: 
pressure-free matter (e.g., “cold dark matter”). What- 
ever these relations may be, we usually require that var- 
ious energy conditions hold (e.g., p > 0, (p + p) > 0) 
and additionally demand that the isentropic speed of 
sound c s = (pip ) s obeys 0 ^ c s ^ 1, as required for 
local stability of matter (the lower bound) and causality 
(the upper bound). 

The field equations are second-order hyperbolic 
equations for the metric tensor, with the matter ten- 
sor as the source. A generally used convenient form 
is given by the ADM formalism (as described by Anni- 
nos), as used, for example, in numerical relativ- 
ity [V. 15 §2.3] (the ADM formalism is named for its 
authors: Arnowitt, Deser, and Misner). 

1.3 The Geodesic Deviation Equation and 
Tidal Forces 

The geodesic deviation equation (GDE) determines 
the change in the deviation vectors (relative position 
vectors) linking a congruence of time-like geodesics. 
Consider the normalized tangent vector held V a := 
dx a (T)/dT for such a congruence x a (T,iv), with 
curves labeled by parameter iv. Then V a V a = — 1, 
8V a l§T = V a . b V b = 0. A deviation vector q a := 
dx a ( w ) /dw can be thought of as linking pairs of neigh- 
boring geodesics in the congruence; it commutes with 
V a , so 5q a /8 t = V a . b q b . Choosing the deviation vec- 
tors to be orthogonal to v a : q a V a = 0, by (3) the GDE 
takes the form 

S = -R a bcd V b q c V d , (7) 


showing how curvature causes relative acceleration of 
matter, i.e., tidal forces. 

In this equation, the Ricci tensor is determined point 
by point through the EFEs, but this is only part of the 
curvature tensor. The rest of the curvature is given by 
the Weyl tensor , which is defined by 

Cijkl ■ — Rijkl + 2 (Fik0jl + RjWik ~ Rilfljk ~ RjkPil ) 

-\R(gikgji- gugjk), ( 8 ) 

implying that it shares the symmetries of the Riemann 
tensor but is also trace free: C a bad = 0. This tensor 
represents the free gravitational field; that is, it is the 
part of the curvature that is not determined pointwise, 
thus enabling nonlocal effects such as tidal forces and 
gravitational waves. As with the electromagnetic field, 
it can be decomposed into electric and magnetic parts, 
Eac = CabcdU b U d , Hac = \r)abefU h cd U d \ through 
the GDE, these fields affect the relative motion of mat- 
ter, causing tidal effects. Because of the Bianchi iden- 
tities, they obey Maxwell-like equations, resulting in a 
wave equation underlying the existence of gravitational 
waves. 

1.4 Types of Solutions and Generic Properties 

Although the EFEs are complicated nonlinear equations 
for the metric tensor, numerous exact solutions are 
known (and are described well in Stephani et al.’s book 
Exact Solutions of Einstein’s Field Equations ). These are 
usually determined by imposing symmetries on the 
metric, often reducing hyperbolic equations to ordinary 
differential equations, and then solving for the free 
metric functions. Symmetries are generated by Killing 
vector fields E, a (xi)\ that is, vector fields that obey 
Killing’s equation 

^a\b Y E,b\a = 0 

(the metric is dragged into itself invariantly along the 
integral curves of the vector fields). They obey the rela- 
tion 5 ; ,. cd = E, a Rabcd , meaning that the initial data for 
them at some point are g a |o, (5[a;b])lo- They form a Lie 
algebra generating the symmetry group of a space-time; 
the isotropy group of a point P is generated by those 
Killing vectors for which §„ lo = 0 (they leave the point 
fixed). 

The most important symmetries in practice are 
spherical symmetry and spatial homogeneity, which 
form the bases of the following two sections of this arti- 
cle, respectively. A space-time is stationary if it admits 
a time-like Killing vector field and is static if this field 
is additionally irrotational: 5[ a; i,§c] = 0. 
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As well as finding exact and approximate solutions 
of the EFEs, one can find important generic relations 
such as the Raychaudhuri equation. This equation 
underlies generic singularity theorems. One can also 
prove generic existence and uniqueness theorems for 
vacuum- and matter-filled space-times. There has also 
been huge progress in finding numerical solutions of 
the EFEs, based on the ADM formalism. 


2 The Schwarzschild Solution and Black Holes 

To model the exterior field of the sun or of any spherical 

star, we look for a solution that is a 

(1) vacuum solution (i.e., T a b = 0), R a b = 0, so the cur- 
vature is purely due to the Weyl tensor (this will be 
a good approximation outside some radius rs rep- 
resenting the surface of the central star in a solar 
system); and 

(2) a spherically symmetric solution, as is true to high 
approximation in the solar system (here we ignore 
the rotation of the sun as well as the gravitational 
fields of the planets). 

We choose coordinates for which the metric is 
ds 2 = -A(r, t) dt 2 + B(r, t) dr 2 + r 2 dll 2 , 

where d!2 2 = d0 2 + sin 2 6 dc p 2 is the metric of a unit 

2-sphere. To work out the field equations, we must 

(i) calculate the connection components determined 
by this metric, 

(ii) determine, from these, the Ricci tensor compo- 
nents, and 

(iii) set these components to zero, and solve for A{r, t), 

B{r, t). 


Using the boundary conditions of asymptotic flat- 
ness, {r — oo} => [A{r, t) — 1, B(r, t) -» 1}, and deter- 
mining an integration constant by comparing with the 
Newtonian solution for a star of mass m ( = MG/c 2 in 
physical units), one gets 


d5 2 


1 - 


2 m 


dt- 


dr 2 +r 2 dfi 2 . (9) 


This is the Schwarzschild solution. It is an exact solution 
of the EFEs, valid for r > r%, where rs is the coordinate 
radius of the surface of the central massive object; we 
require that rs > 2m, where m is the Schwarzschild 
radius of the object. This is the mass in geometrical 
units. For the Earth m » 8.8 mm, and for the sun m » 
2.96 km. 


A remarkable result is hidden in the analysis above: 
A and B are in fact functions only of r. Consequently, 
a spherically symmetric vacuum exterior solution is 
necessarily static. This is Birkhoff’s theorem. 

Overall, this shows that the Schwarzschild solution 
is the valid exterior solution for every spherical object, 
no matter how it is evolving. It can be static, col- 
lapsing, expanding, pulsating; provided it is spheri- 
cally symmetric, the exterior solution is always the 
Schwarzschild solution! This expresses the fact that 
general relativity does not allow monopole or dipole 
gravitational radiation, so spherical pulsations cannot 
radiate their mass away as energy. 

The nature of particle orbits (time-like geodesics) fol- 
lows from this metric, allowing circular orbits for any 
radius greater than r = 3m. The metric also determines 
the light ray paths: the radial null rays in the solution 
(9) have d5 2 = 0 = d0 2 = d <p 2 , giving 

, 1 = ± t = ±r* + const., (10) 

dr 1 - 2m/ r 


where 
dr* = 


dr 


<=> r = r 


2m In 


(£-*)■ 


1 - 2 m/r 

This is the equation of the local null cones, showing 
how light is bent by this gravitational field. In addi- 
tion, outgoing light experiences redshifting depending 
on the radii r e and ro, where the source and receiver are 
located. The redshift z measured for an object emitting 
light of wavelength A e that is observed with wavelength 
Ao is given by 1 + z := Ao/A e . Wavelength A is related 
to period At by A = cAt, with proper time t along a 
world line {r = const.} determined by the metric (9). 
Because Ato = A t e follows from the null ray equation 
(10), the redshift z is 

A 0 At 0 / l-2m/r 0 l 1/2 

A e AT e Vl-2 m/r e ) 

This is gravitational redshift, caused by the change in 
potential energy as photons climb out of the potential 
well due to the central mass. 


1 + z := — = 


2.1 The “Singularity” at r = 2m 

Something goes wrong with the metric at r = 2m. What 

happens there? The following all seem to be the case. 

(1) Singular metric. The metric components are singu- 
lar at r = 2m. 

(2) Radial infall. Considering the radial infall of a test 
object (with nonzero rest mass), the proper time 
taken to fall from r = ro > 2m to r = 2m is finite. 
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However, the coordinate time is given by dt/dT = 
(1 - 2m/ro ) 1/2 / (1 - 2 m/r), which is infinite as r — ■ 
2m. This suggests that it never actually reaches that 
radial value, whereas the object measures a finite 
proper time for the trip. 

(3) Infinite redshift. The redshift formula (11) shows 
that z — ■ oo as r e — ■ 2m. 

(4) Unattainable surface. On radial null geodesics (10), 
dt/dr — ±oo asr — ■ 2m. Accordingly, light rays from 
the outside region (r > 2m) cannot reach the sur- 
face r = 2m; rather, they become asymptotic to it as 
they approach it. The same will hold true for time-like 
geodesics. 

(5) Geodesic incompleteness. These time-like and null 
geodesics are incomplete ; that is, they cannot be 
extended to arbitrarily large values of their affine 
parameter in the region 2m < r < oo. In particu- 
lar, for time-like geodesics the proper time along the 
geodesics from ro to 2m is finite. This shows that 
the geodesic, moving inward as t increases, cannot 
be extended to infinite values of its affine parameter 
(namely, proper time). 

(6) Nonstatic interior. Going to the interior part (0 < 
r < 2m), the solution changes from being static 
to being spatially homogeneous but evolving with 
time. This is because the essential metric depend- 
ence is still with the coordinate r , which has changed 
from being a coordinate measuring spatial distances 
to one measuring time changes. The nature of the 
space-time symmetry therefore changes completely 
for r < 2m. 

(7) A physical singularity? To try to see if the singu- 
larity at r = 2m is a physical singularity or a coor- 
dinate singularity, we look at scalars (because they 
are coordinate-independent quantities) constructed 
from mathematical objects that describe the cur- 
vature. Because the solution is a vacuum solution 
(R a b = 0), both R = R a a and R ab R a i, vanish. The 
simplest nonzero scalar is the Kretschmann scalar 

RabcdR abcd = 48m 2 /r 6 . 

This is finite at r = 2m but diverges as r — 0. 
This suggests (but does not prove) that r = 2m is 
a coordinate singularity, that is, there is no problem 
with the space-time but rather the coordinates break 
down there, and so we can get rid of the singular- 
ity by choosing different coordinates. We prove that 
this supposition is correct by making extensions of 
the solution across the surface r = 2m. One can do 


r = 2m 



Figure 1 Radial null geodesics and local light cones in the 
Schwarzschild solution in Eddington-Finkelstein coordin- 
ates (reproduced from Hawking and Ellis (1973), with per- 
mission). 

this by attaching coordinates to either time-like or 

null geodesics that cross this surface. 

2.2 Schwarzschild Null Coordinates 

Defining the coordinates v+, V- by v± = t ± r* , then 
dv ± = dt±dr/(l - 2m/r); the outgoing null geodesics 
are {u+ = const.} and the ingoing ones are { v - = 
const.}. We change to coordinates ( v,r,9,4 > ), where 
v+ = v. The metric is then 

d5 2 = — f 1 — 'j du 2 + 2dv dr + r 2 dQ l . (12) 

This is the Eddington-Finkelstein form of the metric. 
The transformation has succeeded in getting rid of the 
singularity at r = 2m. The coordinate transforma- 
tion (which is singular at r = 2m) extends the orig- 
inal space-time region (denoted region I) defined by 
2m < r < o o to a new region (denoted region II) defined 
by 0 < r < 2m. It is an analytic extension across the 
surface r = 2m of the outside region I to the inside 
region II. 

Plotting these local null cones and light rays in the 
space with coordinates ( t*,r ), we get the Eddington- 
Finkelstein diagram (figure 1). We can observe the 
following features in the figure. 

(1) The surfaces of constant r are vertical lines in this 
diagram. The surfaces of constant t are nearly flat at 
large distances but bend down and never cross the sur- 
face r = 2m. In fact, t diverges at this surface; and it is 
this bad behavior of the t-coordinate that is responsible 
for the coordinate singularity at r = 2m. This is why 
the coordinate time diverges for a freely falling particle 
that crosses this surface. 
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(2) The null cones tilt over, the inner ray always being 
at 45° inward, but the outer one (pointing outward for 
r > 2m) becomes vertical at r = 2m and points inward 
for r < 2m. The surface r = 2m is therefore a null sur- 
face (a light ray emitted outward at r = 2m stays at 
that distance from the center forever). Because of this, 
it is a trapping surface: particles that have fallen in and 
crossed this surface from the outside region I to the 
inside II can never get out again. Once in region II their 
future is inevitably to be crushed by divergent tidal 
forces near the singularity at r = 0. 

(3) Conversely, r = 2m is an event horizon , hiding its 
interior from the view of outside observers. If we con- 
sider an observer who is static at r = r\ > 2m, her 
world line is a vertical line in this diagram. Her past 
light cone never reaches inside r = 2m, so no sig- 
nal from that region can reach her. This space-time 
may therefore reasonably be called a black hole, for no 
radiation emitted by the inside region II can reach the 
outside region I. 

(4) If the outside observer drops a probe into the cen- 
ter, then it crosses the event horizon r = 2m in a finite 
proper time but takes an infinite coordinate time t to 
get there because t diverges there. If it emits pulses 
at regular intervals (say, every second), these will be 
received by the outside observer at longer and longer 
time intervals. If the probe crosses the event horizon 
at 12:00 according to its internal clock and sends out a 
radio signal at that time, the signal will never reach the 
outside observer; it stays at r = 2m forever. Every sig- 
nal sent before then will (eventually) reach the outside 
observer and every signal sent afterward will fall into 
the central singularity. The infinite slowing down of 
the received signals as the probe approaches the event 
horizon will result in the redshift in received signals 
diverging as r e — 2m. 

This analysis shows convincingly that r = 2m is a 
null surface (the event horizon) at which the original 
coordinates go wrong; the space-time can be extended 
across this null surface by a change to new coordin- 
ates. However, there is a problem with what we have 
so far: namely, the original solution is time symmetric. 
The extension is not, as is obvious from the Eddington- 
Finkelstein diagram. 

2.2.1 The Time-Reversed Extension 

We can make another similar Eddington-Finkelstein 
extension in which we choose the other direction 


of time for the extension by using the other null 
coordinate. 

To be more precise, upon changing from (f, r,Q,<p) 
to coordinates ( w,r,0,<f> ), where V- = w, the metric 
now takes the form 

ds 2 = - ^1 - j du/ 2 - 2dic dr + r 2 dO 2 , 

which is the time-reverse of (12). This is also an Edding- 
ton-Finkelstein form of the metric, now extending the 
original space-time region I, defined by 2 m < r < oo, to 
a further new region II', defined by 0 < r < 2m. This 
leads to a time-reversed Eddington-Finkelstein picture 
of the local light cones. This time it is the ingoing null 
geodesics that are badly behaved at r = 2m (the out- 
going ones cross this surface with no trouble). The sur- 
face r = 2m is again a null cone, but this time it rep- 
resents the surrounding horizon of a white hole: sig- 
nals can come out of it but not go into it. The exterior 
observer can receive messages from region II' but can 
never send signals there. The surface r = 2m is now 
the same as t = -oo. This divergence is the reason the 
original metric went wrong at this surface. 

How do we know that region II is different from 
region IT? In the original metric, both the past and the 
future inward-pointing radial null geodesics through 
each event q in region I were incomplete. The first 
extension completed the future-ingoing null geodesics 
but not the past-ingoing ones; the second extension 
completed the past-ingoing null geodesics but not the 
future-ingoing ones. 

2.3 Kruskal-Szekeres Coordinates 

The question now is whether we can make both exten- 
sions simultaneously. The time asymmetry of each 
extension is because we used only one null coordinate 
in each case. To obtain a time-symmetric extension, we 
must use both null coordinates. Indeed, if we change 
to coordinates ( v,w , 0, 4>), we obtain the double null 
form of the metric: 

ds 2 = - ^1 - j (Jy (Jjy + r 2 dF2 2 , 

where j{v - w) = r + 2mln((r/2m) - 1) defines 
r{v,w) (both quantities being equal to r*), and t = 
j(v + w). This is a time-symmetric double null form 
but is singular at r = 2m. However, if we rescale the 
null coordinates, we can attain what we want. Defin- 
ing V = e v / 4m , W = we obtain a new set of 

double null coordinates for the Schwarzschild solution. 
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Figure 2 The Kruskal diagram, representing the maximal 
extension of the Schwarzschild solution in conformally flat 
coordinates (reproduced from Hawking and Ellis (1973), 
with permission). 

The metric becomes 

d5 2 = - e~ r/2m dV dW + r 2 dfi 2 . 
r 

This metric form is regular at r = 2m and gives us 
the extension we want. The light cones are given by 
V = const., W = const. It is convenient to define 
the associated time and radial coordinates by 

x' = |(V — W), t r — j (V + W). These are the Kruskal- 
Szekeres coordinates , giving the Kruskal diagram in fig- 
ure 2, the complete time-symmetric extension of the 
Schwarzschild solution with light rays at ±45°. 

It is important to remember that this is a cross sec- 
tion of the full space-time; in fact, each point repre- 
sents a 2-sphere of area 4nr 2 of the full space-time 
(we have suppressed the coordinates ( 6 , <p) in order to 
draw this diagram). The whole solution is time symmet- 
ric, as desired. The most important new feature is that 
there is a new region f in addition to the three regions 
I, II, and IT already identified. Let us see why this is so. 

The region I in this diagram corresponds to the 
same region I in both the Eddington-Finkelstein dia- 
grams. The region t' > x' (bounded on the left by 
the line at -45° through the origin) corresponds to 
the first extension to region II completing the future- 
ingoing null geodesics. This part of the Kruskal dia- 
gram corresponds point by point to the first Eddington- 
Finkelstein diagram; in particular, the vertical null 
geodesics at r = 2m correspond to the null geodesic 
at +45° through the origin in the Kruskal diagram. 
The region t' < x' (bounded on the left by the line 
at +45° through the origin) corresponds to the sec- 
ond extension to region IT, completing the past-ingoing 
null geodesics (moving in the opposite direction to 


the future-ingoing ones). This part of the Kruskal dia- 
gram corresponds point by point to a time-reversed 
Eddington-Finkelstein diagram. 

Now consider a point in region II. The past-outgoing 
(i.e., moving to the right) null geodesics cross r = 2m 
to the asymptotically flat region I. The past-ingoing (i.e., 
moving to the left) null geodesics are completely sym- 
metric with them; they must cross r = 2 m to an asymp- 
totically flat region I' that is identical to region I. Simi- 
larly, for points in II', the outgoing future-directed null 
geodesics cross r = 2 m to region I, and the ingoing 
future-directed null geodesics must cross r = 2m to 
an identical region I". What is perhaps not immedi- 
ately obvious is that this is the same region as I'. The 
following features then arise. 

(1) The surfaces of constant r are given by x' 2 - t' 2 = 
const, and therefore correspond to the hyperbola at 
constant distance from the origin in flat space-time. 
They are space-like for 0 < r < 2m and time-like for 
r > 2m (with two surfaces occurring for each value 
of r). There are two singularities at r = 0: one in the 
past (t' < 0) and one in the future (t' > 0). The 
surfaces r = 2m are the two intersecting null sur- 
faces through the origin. The surfaces of constant t are 
the straight lines through the origin. This coordinate 
diverges at both surfaces r = 2m. 

(2) The null surface segments {r = 2m, t' > 0} 
(obviously representing motion at the speed of light) 
are trapping surfaces: particles that have fallen in and 
crossed this surface from either outside region to the 
inside region II can never get out again; they inevitably 
fall into the future singularity at {r = 0, t' > 0}, where 
they are crushed by infinite tidal forces. Each of these 
segments is also an event horizon, hiding its interior 
from the view of outside observers. They bound the 
black hole region II of the space-time, from which no 
light or other radiation that is emitted can reach the 
outside. The details of dropping in a probe from the 
outside can be followed in this diagram, revealing again 
the same effects as discussed above. Thus we have (vac- 
uum) black holes, bounded by event horizons — from 
which light cannot escape. 

(3) There are two curvature singularities, each of which 
has a space-like nature, one in the past and one in the 
future. The complete solution has both a white hole and 
a black hole singularity (the former emitting particles 
and radiation into the space-time, the latter receiving 
them). 


IV.40. General Relativity and Cosmology 


587 


It is particularly important to note here that, unlike 
in the electromagnetic or Newtonian gravity cases, we 
do not find a singular time-like worldline at the center 
of the Schwarzschild solution, which would represent a 
particle generating the solution. Here, the strong grav- 
itational held for r < 2m profoundly alters the nature 
of the singularity from what we first expected. 

(4) There is also an unexpected global topology, with 
two asymptotically flat spaces back to back, joined by 
a neck called a wormhole. To see this, consider the sur- 
face {t' = 0}; the area of the 4-spheres [r = const.}, 
which diverges as x' -> oo, decreases to a minimum 
value A* = 4tt(2w 2 ) at x' = 0 and then diverges 
again as x' —■ -oo. However, one cannot communicate 
between the two asymptotically flat regions through 
the wormhole because only space-like curves can pass 
through. 

(5) The nature of the space-time symmetry changes 
at the surfaces r = 2m: here, there is a transition 
from a static to an evolving (Kantowski-Sachs) uni- 
verse. The event horizon is therefore also a Killing hori- 
zon, generated by null Killing vectors that continue to 
be space-like on one side and time-like on the other. 

(6) This solution is indeed the maximal extension of the 
initial Schwarzschild space-time: all geodesics in it are 
either complete (that is, they go to infinity) or they run 
into one of the singularities where r = 0. Hence, no 
further extensions are possible (all curves end up at a 
singularity or at infinity). 

2.4 Gravitational Collapse to Black Holes 

I have dealt with the maximal Schwarzschild solution 
at length because it is such a good example of general 
relativity. The subtleties of coordinates and their limits, 
the interpretation of time-like and null geodesics, and 
the examination of global topology are all involved in 
understanding all of this complex structure hidden in 
the seemingly innocuous Schwarzschild metric (9). But 
is it just a mathematical solution with no relevance to 
the real universe? What is its relation to astrophysics? 

Black holes play an important role in high-energy 
astrophysics: they occur at the endpoint of the lifetime 
of massive stars; they are at the center of many galax- 
ies, where they are surrounded by accretion disks; and 
they are the powerhouse for high-energy processes in 
quasistellar objects. The collapse of a spherical object 
to a black hole starts off in a regular space-time region 
with no horizons; these develop as matter coalesces 


and the gravitational held intensifies, causing future- 
directed light rays to be trapped. There is no singularity 
in the past, but one occurs in the future, hidden inside 
the event horizon. There is no wormhole through to 
another space-time region because the infalling matter 
cuts it off. 

Diagrams of the resulting space-times are given in 
Hawking and Ellis (1973) and in Ellis and Williams 
(2000). Rotation will complicate matters considerably 
(we get a Kerr black hole instead of a Schwarzschild 
solution). The role of black holes in astrophysics is 
discussed in depth by Begelman and Rees (2009). 

3 Cosmology 

In the previous example the curvature was pure Weyl 
curvature (because R a \, = 0); space-time is unaffected 
by matter properties because it is a vacuum solution. In 
the next example there is, by contrast, pure Ricci cur- 
vature; there is no free gravitational held ( C a bcd = 0). 
The evolution of space-time is determined by its matter 
content. 

Gravity governs the universe in the large. On a large 
enough scale, the observed universe can be represented 
well by a Robertson-Walker metric, which is spatially 
homogeneous (no spatial point is preferred over any 
other) and isotropic (no direction in the sky is preferred 
over any other). Comoving coordinates can be chosen 
so that the metric takes the form 

ds 2 = -dt 2 + S 2 (t) dcr 2 , u a = S a 0 (a = 0, 1,2,3), 

(13) 

where Sit) is the time-dependent scale factor, and the 
worldlines with tangent vector u a = Ax a /At represent 
the histories of fundamental observers. The space sec- 
tions { t = const.} are surfaces of homogeneity and 
have maximal symmetry: they are 3-spaces of con- 
stant curvature K = klS 2 (t), where k is the sign of 
K. The metric dcr 2 is a 3-space of normalized constant 
curvature k; coordinates (r, 0, <p) can be chosen such 
that 

dcr 2 = dr 2 +/ 2 (r)(d0 2 + sin 2 0A<p 2 ), (14) 

where fir) = {sinr, r, sinhr} if fc = { + 1,0, -1}, 
respectively. 

The metric is time-dependent, with distances be- 
tween all fundamental observers scaling as Sit) and 
the rate of expansion at time t characterized by the 
Hubble parameter Hit) = S/S. To determine the met- 
ric’s evolution in time, one applies the EFEs to the 
metric (13), (14). Because of local isotropy, the matter 



S88 


IV. Areas of Applied Mathematics 


tensor T ab necessarily takes a perfect fluid form rel- 
ative to the preferred worldlines with tangent vector 
u a . The integrability conditions (5) for the EFEs are the 
energy-density conservation equations, which reduce to 

T ab . b = 0 <=> p + (p + p)3S/S = 0. (15) 

This becomes determinate when a suitable equation of 
state relates the pressure p to the energy density p. 
Baryons have pb = 0 and radiation has p r = p r /3, p r = 
aT 4 , which by (15) implies that pb °c S -3 , p r °c S~ 4 , 
T r oc S 1 . Radiation dominates at early times (the hot 
big bang era), matter dominates at late times; at very 
early times (the inflationary era), a scalar field may have 
been dynamically dominant. 

The scale factor S ( t ) generically obeys the Raychaud- 
huri equation : 

3S/S = - |k(p + 3p) + A, (16) 


where k is the gravitational constant and A the cos- 
mological constant. This shows that the active gravita- 
tional mass density of the matter and fields present is 
Pgrav ■ = P + 3 p, summed over all matter present. For 
ordinary matter this will be positive, so ordinary mat- 
ter will tend to cause the universe to decelerate (S < 0). 
A positive cosmological constant on its own will cause 
an accelerating expansion ( S > 0). When matter and a 
cosmological constant are both present, either result 
may occur depending on which effect is dominant. The 
first integral of equations (15), (16) when S * 0 is the 
Friedmann equation: 


S 2 _ Kp A k 

S 2 = T + 3 ~ S 2 ' 


(17) 


Models with a Robertson-Walker geometry with met- 
ric (13), (14) and dynamics governed by equations 
(15), (16), (17) are called Friedmann-Lemaitre (FL) uni- 
verses. 

The simplest solution is the Einstein-de Sitter uni- 
verse, with flat spatial sections and baryonic matter 
content, and no cosmological constant. We then have 

{k = 0, p = 0, A = 0} => S(t) = t 213 . 


The FL models are the standard models of modern 
cosmology, and they are surprisingly effective in view 
of their extreme geometrical simplicity. One of their 
great strengths is their explanatory role in terms of 
making explicit the way the local gravitational effect 
of matter and radiation determines the evolution of 
the universe as a whole, with this in turn forming 
the dynamic background for local physics (includ- 
ing determining the evolution of matter and radiation 
themselves). 


3.1 An Initial Singularity? 

The universe is presently expanding; following it back 
into the past, it was ever smaller. A key issue is whether 
it had a beginning or not. The Raychaudhuri equation 
(16) leads directly to the following result. 

The Friedmann-Lemaitre universe singularity theo- 
rem. In an FL universe with A $ 0 and p + 3p > 0 at all 
times, at any instant to when Ho = (S/S) o > 0, there 
is a Unite time t * such that S(t) -> 0 as t — ■ t*, where 
to - (1/Ho) < t* < to- If p + p > 0, the universe starts 
at a space-time singularity there: {t — ■ t*} =*■ {p — co}. 

This is not merely a start to matter — it is a start to 
space, to time, to physics itself. It is the most dramatic 
event in the history of the universe; it is the start of the 
existence of everything. The underlying physical fea- 
ture is the nonlinear nature of the EFEs: going back into 
the past, the more the universe contracts, the higher the 
active gravitational density, causing it to contract even 
more. The pressure p that one might have hoped would 
help stave off the collapse makes it even worse because 
p enters algebraically into the Raychaudhuri equation 
(16) with the same sign as the energy density p. 

This conclusion can, in principle, be avoided by intro- 
ducing a cosmological constant, but in practice, this 
cannot work because its value is too small. However, 
energy-violating matter components such as a scalar 
field can avoid this conclusion if they dominate at early 
enough times. In the case of a single scalar field <p 
with space-like surfaces of constant density, on choos- 
ing u a orthogonal to these surfaces, the stress ten- 
sor has a perfect fluid form with p = + V(<f>), 

p = 2 (j) 2 - V(<p), and so p + 3p = 2<fi 2 - 2V(<p ). The 
slow-rolling case is 4> 2 <k V(<fi), leading to p + p = 
2(f) 2 « 0 =*■ p + 3p - -2 p < 0. Quantum fields can in 
principle therefore avoid the conclusion that there was 
a start to the universe. The jury is still out as to whether 
or not this was the case. 

3.2 Observational Relations 

To be good models of the real universe, cosmological 
models must predict the right results for astronom- 
ical observations. To determine light propagation in 
a Robertson-Walker geometry, it suffices to consider 
only radial null geodesics: {ds 2 = 0, d6 = 0 = d(f>} (by 
the symmetries of the model, these are equivalent to 
generic geodesics). From the Robertson-Walker metric 
(13), we then find that for light emitted at time t e and 
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received at time to, the comoving radial coordinate dis- 
tance n(to,t e ) := ro - r e between comoving emitters 
and receivers is given by 


fto r 

U(t °’ te) = J te 5(t) = i 


(18) 


f to dt _ f So dS 

J te W) ~ Jse S$’ 

with S given by the Friedmann equation (17). The func- 
tion u(to,t e ) therefore encodes information on both 
spatial curvature and the space-time matter content. 

The key quantities related to cosmological obser- 
vations are redshift, area distance (or apparent size), 
and local volume, corresponding to some increment in 
distance (determining number counts). From (18), the 
cosmological redshift z c is given by 


1 + z c 


S(tp) 

S(t e )' 


(19) 


The same ratio of observed to emitted light holds for 
all wavelengths, a key identifying property of redshift. 

The apparent angular size « of an object at redshift 
z c and of linear size l is given by 


l/a = f(u)S{t e ) = f(u)( 1 + z)S(t o). (20) 


Measures of apparent sizes as functions of redshift will 
therefore determine flu) if the source physical size is 
known. The flux of radiation F measured from a point 
source of luminosity I emitting radiation isotropically 
is given by the fraction of radiant energy emitted by the 
source in a unit of time that is received by the telescope: 


4tt f 2 {u)S 2 {to)(l + z) 2 

(the two redshift factors account firstly for the time 
dilation between observer and source, and secondly for 
loss of energy due to redshifting of photons). Measures 
of apparent luminosity m as a function of redshift will 
therefore also determine flu) if the source intrinsic 
luminosity is known. 

The number of objects in a solid angle di2 for a dis- 
tance increment dw (characterized by increments dz, 
dm in the observable variables z and m) is given by 

dN = n!t e )S 3 !t e )flu) dw df2, (22) 


where n(f e ) is the number density of objects at the time 
of emission. The observed total number N of objects 
measured in a survey is given by integrating from the 
observer to the survey limit: in terms of the radial coor- 
dinate r e of the source (which can be related to red- 
shifts or magnitudes), N = Jl e dN. If the number of 
objects is conserved, n(t e ) = n(fo)(l + z) 3 , and we 
find from (22) that 

fie 

N(t e ) = n(to) di2 f(u)du. (23) 

■Vo 


A measure of numbers as a function of redshift or other 
distance indicators will therefore determine the num- 
ber density of objects, which is then related to their 
mass and so to the energy density of these objects. 

As flu) is determined by the Friedmann equation, 
the above equations enable us to determine observa- 
tional relations between observable variables for any 
specific cosmological model and therefore to observa- 
tionally test the model. 

3.3 Causal and Visual Horizons 

A fundamental feature affecting the formation of struc- 
ture and our observational situation is the limits that 
arise because causal influences cannot propagate at 
speeds greater than the speed of light. The region that 
can causally influence us is therefore bounded by our 
past null cone. Combined with the finite age of the uni- 
verse, this leads to the existence of particle horizons 
that limit the part of the universe with which we can 
have had causal connection. 

A particle horizon is, by definition, composed of the 
limiting worldlines of the furthest matter that ever 
intersects our past null cone. This is the limit of mat- 
ter that we can have had any kind of causal contact 
with since the start of the universe, which by (18) is 
characterized by the comoving radial coordinate value 

f to dt 

Mph = Jo Sit) ' (24) 

The present physical distance to the matter constitut- 
ing the horizon is 

dph — S (to)ltph. 

The key question is whether the integral (24) converges 
or diverges as we go to the limit of the initial singularity 
where S 0. Horizons will exist in standard FL cosmolo- 
gies for all ordinary > matter and radiation because u p h 
will be finite in those cases; for example, in the Einstein- 
de Sitter universe, it p h = 3tg /3 , d p h = 3to. We will then 
have seen only a fraction of what exists, unless we live 
in a universe with spatially compact sections so small 
that light has indeed had time to traverse the whole uni- 
verse since its start; this will not be true for universes 
with the standard simply connected topology. 

Penrose’s powerful use of conformal methods gives a 
very clear geometrical picture of the nature of horizons. 
One switches to coordinates where the null cones are 
at ±45°; the initial singularity (the start of the universe) 
is then represented as a boundary to space-time. If this 
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Figure 3 Particle horizons in a world model (reproduced 
from Hawking and Ellis (1973), with permission). “0” is the 
observer (at the origin); the horizontal line at the bottom is 
the start of the universe. Here and now is the event “P.” 


is a space-like surface, then there exist galaxies beyond 
our causal horizon; we can receive no signals from them 
whatever (figure 3). 

The importance of horizons is twofold: they under- 
lie causal limitations that are relevant to the origin of 
structure and uniformity, and they represent absolute 
limits on what is testable in the universe. Many present- 
day speculations about the superhorizon structure of 
the universe (e.g., chaotic inflationary theory) are not 
observationally testable because one can obtain no def- 
inite information whatever about what lies beyond the 
visual horizon. This is one of the major limits to be 
taken into account in our attempts to test the veracity 
of cosmological models. 

3.4 Structure Formation 

A large amount of work in cosmology is concerned 
with structure formation: how galaxies come to exist 
by gravitational attraction acting on primordial pertur- 
bations. This raises a key technical issue, the choice 
of perturbation gauge, which is best solved by using 
a gauge-free perturbation formalism. When the pertur- 
bations become nonlinear, numerical simulations are 
needed to study the details of structure formation. 


4 Issues 

This article has emphasized the significance of geom- 
etry and topology in general relativity theory: a fasci- 
nating study of the interplay of geometry, analysis, and 
physics. Further topics I have not mentioned include 
how one derives the Newtonian limit of the general rela- 
tivity equations; gravitational lensing; the emission and 
detection of gravitational radiation; the key issue of the 
nature of dark energy in cosmology, which has led to 
the acceleration of the universe in recent times; and the 
issue of dark matter— nonbaryonic matter that domi- 
nates structure formation and the gravitational dynam- 
ics of clusters of galaxies. A key requirement is to test 
the EFEs in every possible way, to try alternative theo- 
ries of gravitation to see if any of them work better than 
Einstein's theory. So far it has withstood these tests; it 
is the best gravitational theory available. 
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V.l The Mathematics of Adaptation 
(Or the Ten Avatars of Vishnu) 

David Krakauer and Daniel N. 

Rockmore 

1 Adaptation and Selection 

In a celebrated letter to his sister and theologian 
Simone Weil, dated March 1940, Andre Weil (writing 
from a military prison in Paris) expounded on the pow- 
erful yet transient role that analogy plays in math- 
ematics as a precursor to unification. He does this 
through another layer of analogy, that of the avatars 
of Vishnu, each of which reveal different facets of 
the deity. Weil had in mind various connections— yet 
to be supported by theorem and proof— that he was 
finding between problems in number theory and alge- 
braic geometry, connections that today are encoded 
in one of the grandest current efforts in mathemat- 
ics: the pursuit of the Langlands program, a Vishnu of 
mathematics. The value of analogy in pure mathemat- 
ics in promoting identification of common foundations 
extends into its dealing with physical reality (even this 
aspect is evident in the Langlands program, realized 
through recent connections made with quantum field 
theory). This was already understood by Poincare when 
he wrote that “the mathematical facts worthy of being 
studied are those which, by their analogy with other 
facts, are capable of leading us to the knowledge of a 
physical law.” 

The pursuit of a “biological law” (or even agreement 
on what would define one) is proving to be some- 
thing of a challenge. But adaptation may be a con- 
text that supports such an achievement. Loosely con- 
strued, it is the collection of dynamics yielding evolved 
or learned traits that contribute positively to survival 
and propagation. Adaptation assumes a bewildering 
array of “avatars,” spanning evolution by natural selec- 
tion, reinforcement learning, Bayesian inference, and 


supervised machine learning, of which neural networks 
are perhaps the most explicitly adaptive. In all of 
these instances adaptation implies some element of 
(1) design or optimization, (2) differential selection or 
success, and (3) historical correlation between evolved 
trait and environment that grows stronger through 
time. Each of these elements has inspired mathemat- 
ical descriptions, and it is only in the past decade that 
we have become aware that several fields concerned 
with the formal analysis of adaptation are in fact deeply 
analogous (an awareness that has been engendered by 
mathematics) and that these analogies suggest a form 
of unification that might be codified as a “biological 
law.” 

It is very difficult to discuss adaptation without 
considering natural selection, and in many treatments 
they are effectively synonymous. Stephen Jay Gould 
bemoaned the ambiguity whereby the word adapta- 
tion can be interpreted as both noun and verb. The 
noun adaptation refers to the constellation of prop- 
erties that confer some locally optimal character on 
an agent, whereas the verb adaptation describes the 
process by which agents adapt through time to their 
environments. It is the latter active process that has 
attracted the greater mathematical effort and includes 
the often counterintuitive properties that are found in 
population dynamics. Adaptation at equilibrium (the 
noun form) simply requires that optimality be demon- 
strated, often through variational techniques derived 
from engineering. Examples include the optical prop- 
erties of the lens, the excitability properties of neu- 
ral membranes, and the kinematic properties of limbs 
and joints. They all fall into this equilibrium class of 
adaptive explanation and are as diverse in their mathe- 
matical treatments as material constraints demand. In 
this article, however, we focus on the genuinely novel 
dynamical features of the adaptive process, the extraor- 
dinary diversity of systems in which they continue to 
be realized, and the associated mathematical ideas. 
The context is generally one of populations evolving 
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in distribution, and in this broad review we touch on 
the techniques of information geometry, evolutionary 
game theory, and mixtures of discrete dynamics and 
probability theory. 

2 Selection Systems 

2.1 Continuous Selection for Diploid Species 

A historically accurate place to start exploring adap- 
tation is with the continuous selection equation for 
diploid genomes from population genetics. In this set- 
ting we are interested in the time evolution of the dis- 
tribution of genomic structures in the population as 
influenced by both endogenous and exogenous factors. 

Consider a single coding site in a genome, which can 
assume multiple, discrete genetic forms or alleles, A,. 
For a diploid organism (one whose genetic profile is 
encoded in two copies of each gene), each site will be 
represented by a pair of values ( Ai,Aj ). The time evo- 
lution of a genome in terms of the frequency in the 
population of the various alleles is determined by the 
coefficients of its net growth rate in competition with a 
population of genomes. This growth rate is set by the 
so-called Malthusian fitness parameter. 

Let us denote the frequency of allele i by x, e [0,1] 
and collect these frequencies in vector x, which then 
has nonnegative elements that sum to 1. 

Adaptive evolution proceeds according to 

Xi = Xi((Mx)i - x t Mx), ( 1 ) 

where M is the matrix of real-valued Malthusian param- 
eters for the set of n genomes. A positive derivative 
reflects an increasing frequency of an allele and is asso- 
ciated with a higher than average value for a given type 
in the M matrix. Rewritten, with mj(x ) = (Mx)i, this 
gives the so-called replicator equation 

xi = Xi(m t (x) - m(x)), 

with m(x) = Xt Ximi(x). This is a differential equa- 
tion on the invariant simplex S n whose points represent 
all possible distribution of alleles, and it has deep cor- 
respondences across a range of fields. We shall come 
back to the importance of the replicator equation once 
we have described a few more properties of the more 
general selection equation. 

The dynamics of (1) is determined by the proper- 
ties of its stable equilibria. These are conventionally 
analyzed through the (weighted) incidence matrix of 
an undirected graph, which can serve as the matrix of 


Malthusian parameters defined by 


mij 


1 

0 

1 

2 


if i and j are joined (i ^ j), 
if i and j are not joined, 
if i = j. 


Consider the sub simplex I s {1 fc}. The barycen- 

ter of the face S n (I) is defined as p = (1/k , . . . , 1/k, 
0, . . . , 0) (the average of extremal coordinates). The 
mean fitness restricted to this face is 


x l Mx 


is 


Over the set of possible frequency vectors (having only 
positive entries with sum 1) this takes its maximal value 
1 - (l/2k) at p, which shows that p is asymptotically 
stable within S n (I). If i $ 1, then there is some j G I 
with my = 0. Hence, 

(. Mp)j ^ p T Mp. 


Thus, p is not only saturated (maximal) but also asymp- 
totically stable in S n . Furthermore, for any system with 
n - i alleles there are at most 2' stable equilibria. 

In this context, adaptation is a description of the 
changing frequencies of competing “species” or geno- 
types whose stable fixed points are governed by the val- 
ues of a Malthusian matrix. This matrix is assumed to 
encode all of those factors that contribute to both mor- 
tality and increased viability or replicative potential. 

A natural way to think about the process of adapta- 
tion is in terms of a flow on an appropriate manifold or 
the setting of information geometry. This framework 
was first introduced to population dynamics and the 
study of adaptation by Shahshahani. It is most easily 
treated for the symmetric case of the replicator equa- 
tion. If we let Xi = mi(x ) = dV /dxi be a Euclidean gra- 
dient vector field on M”, then the replicator equation is 
a Shahshahani gradient on the interior of S n with the 
same potential function V. To be specific, the change 
in V is equal to the variance of the values of m,(x) of 
the replicator equation: 

n 

V(x) = ^ Xi (m, (x) - m(x)) 2 . 

i=i 

This result is referred to as Fisher's fundamental the- 
orem of natural selection, wherein the change in the 
potential — or, obversely, the increase in the mean popu- 
lation fitness — is proportional to the variance in fitness. 
Hence, populations will always increase their adapted- 
ness assuming that there remains latent variability in a 
population. 
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2.2 Information and Adaptation: Game Dynamics 

We have so far considered diploid population genet- 
ics, but this framework can as easily be thought of as 
describing haploid (single-genome, or “asexual,” speci- 
ation) game dynamics. In this case the entries my of 
the Malthusian matrix M describe payoffs associated 
with competition between an agent A; and an agent Aj 
that meet with a probability x,Xj. In the case of pay- 
offs that are linear in the frequencies, the stable fixed 
point of the dynamics is given by a distribution x in 
S n . This is called an evolutionarily stable state (ESS) of 
the replicator equation if x T Mx > x l Mx in the local 
neighborhood of x. 

The ESS framework allows us to define an alternative 
potential function, or Lyapunov function for adapta- 
tion, through the use of information-theoretic concepts. 
We will assume that we are dealing with vector quanti- 
ties at steady state, and for frugality we will drop the 
bold notation for vectors. One can think of the poten- 
tial as “potential information” (a connection hinted at in 
the Shahshahani connection mentioned above) through 
use of divergence functions. In particular, let 

V(x) = ^Xilogx; - ^XilogXj. (2) 

i i 

This is known as the relative entropy [IV.36 §8] or 
Kullback-Leibler divergence between the two distribu- 
tions and is also denoted as D\^(x\\x). Note that it is 
not symmetric in the arguments. 

In a statistical setting it is thought of as the infor- 
mation that is lost when approximating x with x, and 
it quantifies in log units the additional information 
required to perfectly match the “optimal,” “true,” or 
desired distribution. Differentiating with respect to x 
yields 

V(x) = - X*; — ■ 
i x ' 

Recall that the replicator equation has the form x, = 
Xj(m,'(x) - m(x)), which upon substitution into the 
energy function gives us 

V(x) = - ^ijmj(x) + ^Xifh(x). 

i i 

Since the dynamics are restricted to the simplex, 
X; Xi = 1, and given the definition of the mean m(x), 
we obtain 

V(x) = - (x t Mx - x t Mx) < 0 

as long as the true distribution x is an ESS. For an ESS, 
the replicator equation will cause a system to adapt 
toward the fitness function as encoded in x, and it 


will do so by extracting all information present in the 
“strategic environment” and storing this information in 
the population of agents. 

We can also make the connection to information 
theory another way, through an identity with mutual 
information: 

I(X;X) = Dkl(p(x,x)\\p(x)p(x)), 

where Dkl is defined as in (2). As populations evolve 
they increase the information that they share with their 
environments, moving away from statistical indepen- 
dence. 

One immediate value of an explicitly informational 
approach is that it can accommodate adaptive dynam- 
ics that do not reduce variance but tend to increase 
it. For example, the rock-paper-scissors game has a 
Nash equilibrium given by the uniform distribution 
over strategies. Fisher’s fundamental theorem is vio- 
lated because fitness is now determined by the popula- 
tion composition of strategies, and not by an invariant 
fitness landscape. 

3 Evolution, Optimization, 
and Natural Gradients 

In the previous discussion we have focused on adap- 
tation in biological systems. The form of the equa- 
tions of motion has a natural basis in the interac- 
tion between replication and competition. The adaptive 
process was thought of as a variational, i.e., energy- 
minimizing, dynamic that could be described in terms 
of divergence measures on statistical manifolds. We 
can, however, arrive at the same structures by consid- 
ering nonlinear optimization. In this form, rather than 
emphasize biological properties we emphasize a form 
of hill-climbing algorithm suitable for dealing with non- 
convex optimization or parameter estimation problems 
based on the so-called natural gradient of information 
geometry. 

Standard gradient-based optimization assumes a 
locally differentiable function with well-behaved first 
and second derivatives and a search space that is 
isotropic in the sense that any departure from the opti- 
mum leads to an equivalent reduction in the gradient. 
Discrete-time gradient descent therefore updates solu- 
tions (which are still population frequencies in evolu- 
tionary systems) X(t ) through a recursive algorithm: 

X(t + 1) = X(t) - kVm(X(t)). 

These methods work reasonably well for single- 
peaked functions in Euclidean spaces. However, for 
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many real-world problems, objective functions are 
many-peaked, nonisotropic, and live in non-Euclidean 
spaces. This broader setting means that gradient meth- 
ods search space according to the Riemannian struc- 
ture of probabilistic parameter spaces and employ the 
Fisher metric as the natural measure of distance. This 
provides information about the curvature of a statisti- 
cal manifold and promotes more efficient adaptation to 
the optimum along an appropriate geodesic. 

The updating rule for natural gradients includes 
additional geodesic information, 

X(t + 1) = X(t) - kF x l (t)Vm(X(t)), 


encoded in the Fisher metric tensor or Fisher informa- 
tion matrix F. It is derived directly from the infinites- 
imal Kullback-Leibler divergence (shown above to be 
a suitable Lyapunov function for adaptation) between 
probability distributions x and y. 

D^(x\\y) = | xln (^) dx - 

This divergence is asymmetric. A true distance can be 
defined locally. When we expand Dkl to a second-order 
Taylor polynomial, 

Dkl(x || y) ,Jxln[ 1 + g-f)] + iJx(J- 1 ) 2 , 

we see that the first-order term tends to zero for y = 
x(l + s). This leaves us with 

DklMIt)* * j*g-l) 2 . 


Note that In (y) - ln(x) = ln(l + e) ~ e, and that 
khj/) - ln(x) is simply the derivative of ln(x) in the 
E-direction, 3 £ lnx. Hence, 

OklMIJ') ~ \ | x(ln(y) - ln(x)) 2 dx 

~ \ J x ^£ In^Kdr lnx) dx. 

We are interested in some finite distance between 
proximate distributions induced by some E-directional 
perturbation, y = x(l + v'e;): 


' (y)=1 zl 


y 


Sin y Sin y 
3e» dsj 


d y. 


By convention, using the definition of partial infor- 
mation, q = -ln(^), the Fisher metric can be written 
as an expectation value of a quadratic form: 



S 2 qiy) 

dEidEj 


y dy 


1 / S 2 q(y) \ 

2 \ dEidEj ]' 


where the expectation value of g(y) is 
(g(y)) = | g(y)ydy. 

Technically, the Fij (y) values are the elements of a 
positive-definite matrix or Riemannian metric tensor. 
This tensor captures the curvature of a manifold in 
N-dimensional space. In Euclidean space this matrix 
reduces to the identity matrix. Hence, any nonlinear 
optimization that makes use of the true geometry of a 
“fitness” function will respect the dynamics of the repli- 
cator equation describing trajectories that minimize 
distances on statistical manifolds. 

Adaptive dynamics are, therefore, precisely those tra- 
jectories on statistical manifolds that can be said to 
be maximizing the extraction of information from the 
environment. 


4 Discrete and Stochastic Considerations 


The logic of the continuous systems extends to the dis- 
crete case, but the delays intrinsic to maps introduce 
the possibility of periodic and chaotic behavior. The 
discrete replicator equation, or replicator mapping, has 
the form 

x, (t + 1) = Xi(t) + Xi(t)((Mx(t))i - x(t) J Mx(t)). 


Even the low-dimensional, two-player discrete game 
exhibits a full range of complex dynamical behavior. If 
we consider the parametrized payoff matrix 


we can capture a range of so-called evolutionary dilem- 
mas including the “hawk-dove game” (T > 1, S > 0), 
the “stag hunt game” (T < 1, S > 0), and the “pris- 
oner’s dilemma” (T > 1, S < 0). As a general rule, small 
or negative values of S and T induce periodic solu- 
tions, whereas large values of S and T produce chaotic 
solutions. In the constrained case where T = S = A, 
systematically increasing the value of A leads to the 
period-doubling route to chaos. 

In population genetics, the game selection model is 
typically written using the equivalent map 


Xi(t) = Xi(t - 1) 


Zj mjjXjit - 1) 
Y.pq'tnpqXpit - 1 )px(t - l)q‘ 


Discreteness is important for adaptation. These re- 
sults illustrate some of the challenges of evolving reli- 
able strategies when fitness is frequency dependent. 
Deterministic chaos will severely limit the long-term 
stability of a lineage. 
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When models incorporate stochasticity, even greater 
instability ensues. The deterministic results apply most 
strongly in the limit of large populations overcoming 
the phenomenon of “neutral drift.” If the difference in 
fitness values or payoff values between strategies is less 
than 0(1 /IV), then the dynamics are described by an N- 
dimensional random walk with the frequencies 0 and 1 
acting as absorbing states. In small populations, sam- 
pling drift limits adaptation in the sense that we can 
no longer assume that the dynamics will minimize an 
appropriate energy function. The smaller the popula- 
tion, the greater the variance in the Malthusian param- 
eters or payoff differences required in order to increase 
fitness. Stochasticity is treated formally in both the 
“neutral” and “near-neutral” theories of evolution. 

4. 1 Innovation Dynamics 

The systems considered in the previous sections do 
not attempt to explain the “origin” of adaptations but 
instead focus on the efficacy of selection in promoting 
adaptation within an invariant population. In order to 
explore the origin of adaptive novelty, we need to intro- 
duce an intrinsic source of variability. In evolutionary 
theory this is accomplished by means of mutation and 
recombination operators. The natural extension of the 
replication equation to include mutation leads to the 
quasispecies equation 

Xi = ^ qijnijXj - Xifh(x). 
j 

The operator Q is a doubly stochastic matrix of tran- 
sition probabilities encoding mutations from a strain 
Xj — ■ Xj. Geometrically, this operator ensures that the 
boundary of the simplex S n is no longer invariant: tra- 
jectories can flow from outside of the positive orthant 
into S n in such a way that new strains might emerge 
from existing strains that were not present in the initial 
population distribution. 

In much the same way that neutral drift can eliminate 
adaptation, excessive mutation can abrogate hill climb- 
ing, replacing selection with diffusion over the simplex 
defined by the mutation kernel Q. This is known as the 
“error threshold.” If each strain is encoded by a binary 
sequence of length L, then assuming a uniform error 
rate in transmission across each string, the transition 
probabilities can by written in binomial form as 

dij = p dil ( 1 - p) L ~ dii , 

where dy is the Hamming distance between strains, 
and p is the per bit per generation transition prob- 
ability. For any choice of fitness function, the regime 


p > 1/L will completely “flatten” the landscape, elimi- 
nating adaptation altogether. This result serves to illus- 
trate costs and benefits of mutational transformation. 
If mutation is too low, there is a risk of extinction 
arising from the inability to adapt to a novel environ- 
ment, whereas if mutation is too high, there is a risk of 
extinction arising from the complete loss of the adap- 
tive capability through the dissipation of phylogenetic 
memory. 

Recombination is formally more cumbersome but 
plays a role in modulating the rate of evolution within 
a population of recombining strains. For low rates 
of mutation, recombination tends to reduce popula- 
tion variability, leading to increased stability. For high 
rates of mutation that remain below the error thresh- 
old, recombination can tip the population over the 
error threshold and destabilize a population. Hence, 
rather like mutation, recombination is a double-edged 
adaptive sword. 

5 Adaptation Is Algorithmic 
Information Acquisition 

We have shown how adaptation represented through a 
symmetric form of the continuous selection equation 
(the replicator equation) can be thought of as a nat- 
ural gradient that minimizes a potential representing 
information. Adaptive systems are precisely those that 
extract information from their environments (taken to 
include other agents) so as to minimize uncertainty. 
This is made explicit through the fact that the KL diver- 
gence between the joint distribution of a pair of ran- 
dom variables and the product distribution of these 
variables is equal to the mutual information between 
these variables. Adaptation is therefore measured by 
the degree of departure from independence of organ- 
isms and their environments, or from each other. In 
this framework, adaptation continues (driving up the 
mutual information) until there is no further functional 
information to be acquired and, thus, no latent depen- 
dencies remaining to be discovered. This is unlikely to 
occur in reality as environments are rarely static over 
long intervals of time. The very general nature of adap- 
tation suggests that it might manifest in a variety of 
forms, and this is exactly what we find: adaptive dynam- 
ics possesses a number of “avatars,” each of which 
is a framework for obtaining information through the 
iteration of a simple growth and culling process. 

In this section we present a few of the mathematical 
avatars of replicator dynamics in order to demonstrate 
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that adaptation is a ubiquitous nonlinear optimiza- 
tion procedure across a range of fields, many of which 
appear to be unrelated. This reduction to a canonical 
adaptive form constitutes evidence of functional equiv- 
alence with respect to information extraction in the ser- 
vice of increased prevalence, which should rightly be 
thought of as adaptive evolution. 


5.1 Bayesian Inference 


In 1763, Bayes’s posthumous paper “An essay towards 
solving a problem in the doctrine of changes” was pub- 
lished in the Philosophical Transactions of the Royal 
Society. The paper presented a novel method for cal- 
culating probabilities with minimal or no knowledge of 
an event and indicated how repeated experiments can 
lead toward the confirmation of a conclusion. 

bayes’s rule [V.l 1 §1] is the engine behind what 
we now call Bayesian updating and, when appropri- 
ately written, it is revealed to be a natural mathe- 
matical encoding of adaptation. In other words, it is 
an information-maximizing dynamic induced on the 
simplex. In dynamical terms it is a discrete particle 
system that through repeated iteration converges on 
an effective estimate of a true underlying probability 
distribution. 

Whereas we think of adaptation in terms of the differ- 
ential success of organisms, Bayesian updating treats 
adaptation as the differential success of hypotheses. 

The basic updating equation is 




UX ) t - 1 
<Dt-i 


with the change in the concentration of the probabil- 
ity mass P(X) delivered around the highest likelihood 
L(X) values: 


a p(x) t = fmi. /.y -1 - p(x) t - 1 

\L)t - 1 

= ^(KXwl-dln). 

\T)t - 1 

Note that the evolution of the probability distribution 
in continuous time reduces to the simplest form of the 
differential replicator equation: 

X = X(L x - <I>). 

Bayes therefore constructed an adaptive avatar avante 
le lettre of Darwin’s theory. A method for estimating 


the probability of a hypothesis through trial and error 
is the prequel to the differential replication of natural 
selection. 

5.2 Imitation 

In much the same way that Bayesian updating is an 
avatar of adaptation, learning through imitation simi- 
larly qualifies. Imitation is a cognitive strategy for arriv- 
ing at a pattern of behavior based on sampling a pop- 
ulation of model behaviors. After sufficient time has 
elapsed, the most prevalent behavior will be the one 
most frequently imitated. 

Assume that each individual i varies in its strategy 
choices s*. Think of this as a vector specifying a mixed 
strategy defining how individuals should interact. Inter- 
actions between strategies s,- and Sj give rise to pair- 
wise payoffs or rewards ry. Reward information is used 
to decide which strategy to adopt in subsequent inter- 
actions. In particular, differences in payoffs lead to the 
adoption of more successful strategies. The rate of imi- 
tation leading strategy j to imitate a strategy i, Sj — ■ S;, 
can therefore be written in matrix form: 

fij = [(Rg)i - ( Rg)jl + , 

where R is the matrix of rewards and g the vector 
of genotypic or strategic frequencies. Notice that this 
function is only nonzero in the positive half-space, so 
lower payoff strategies are not imitated. The population 
of players will evolve in time according to an imitation 
learning dynamics that is easily seen to reduce to the 
game dynamical form of the replicator equation: 

gi = Gi S(/y — fjiigj 

j 

= gt'Z,\.(Rg)i- ( Rg)j]gj 

j 

= gd(Rg)i - gRg ]■ 


5.3 Reinforcement Learning 

Skinner introduced the linear operator model of rein- 
forcement to explain schedules of behavioral extinction 
and acquisition, and this framework has been enor- 
mously influential, giving rise to a very large class of 
learning rules. Any behavior that is correct is rewarded, 
while incorrect behavior is punished. Over time, rein- 
forcement seeks to make correct behavior probable and 
incorrect behavior improbable. Consider a set of behav- 
iors Xi each associated with a probability x,-. For each 
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behavior there is a reward rj. The incremental change 
in the probability of any action, with a learning rate 
parameter a, can be written as 

Sxk = txriiSik - x k ), 

where Sij is the kronecker delta [1.2 §2, table 3]. 
This gives the updated probability of behavior assum- 
ing that a single action was performed. The average 
change of behavior takes into account the probability 
of an action x,: 

A x k = ^XihXk- 

i 

Combining these two equations we deduce that 

A Xk = X Xi[(X r i(6ik - Xk)] 

i 

= a X xmSik - otx k X Xtn 

i i 

= (xx k r k - ax^Xiri. 

i 

The term X; xpri is simply the mean reward, (r), and 
the change in behavior in continuous time is given by 
the simple replicator equation 

x k = x k (r k - <r>). 

As with Bayesian updating and imitation learn- 
ing, reinforcement learning is an algorithm for iter- 
atively forming an accurate estimate of an underly- 
ing probability distribution imposed through a learning 
environment. 

6 Final Remarks 

In all that precedes we have focused on a very gen- 
eral context of adaptation and the related mathemat- 
ics. This was the most expansive framing of adaptation 
as a form of optimization subject to the constraints 
of probability. When presented in this way we observe 
that adaptation is not unique to biology but is a ubiqui- 
tous phenomenon in nature. One might go so far as to 
assert that the property of adaptation is the defining 
signature of life in the universe and that the mathe- 
matics of adaptation represents a form of algorithmic 
biological law. 

Many have explored the mathematics of complex 
adaptive systems, placing an emphasis on questions of 
“emergence.” These are systems for which, as Nobel 
laureate Phil Anderson famously wrote “more is dif- 
ferent" or “the whole is greater than the sum of the 
parts,” a descriptor that suggests the variety of nonlin- 
ear phenomena they comprise. The formal relationship 


between emergence and adaptive dynamics remains 
somewhat unclear. 

Familiar examples of such systems ran the gamut 
from Ising models to economies. They are often char- 
acterized by hierarchical and multiscale structure as 
well as hysteresis. They have generally emerged over 
time and space, most through processes related to 
those that we have discussed, namely feedback and 
adaptation. 

The mathematical framework presented above is a 
necessary component in the analysis of all such sys- 
tems, but not all consider the process of evolution 
explicitly, or they prefer simulation approaches such 
as agent-based models and genetic or evolutionary 
programming. These involve the disciplines of pro- 
gramming and consideration of more sophisticated 
distributed algorithms. 

Further Reading 

The bibliography below contains some excellent gen- 
eral resources on the ideas presented in this entry. 

The exposition of section 2.1 follows closely the 
exposition presented in Hofbauer and Sigmund (1998). 
Section 5 follows the contents of the review by Krakauer 
(2011), which places the study of selection mechanics in 
the larger context of the study of complex elaborations 
of Maxwell’s demon. 
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V.2 Sport 

Nicola Parolini and Alfio Quarteroni 


1 Introduction 

The increasing importance of mathematical model- 
ing is due to its improved ability to describe com- 
plex phenomena, predict the behavior of those phe- 
nomena, and, possibly, control their evolution. These 
improvements have been made possible by theoretical 
advances as well as advances in algorithms and com- 
puter hardware. In recent years mathematical models 
have received a lot of attention in the world of sport, 
where any (legal) practice that can improve perfor- 
mance is very welcome. By adopting appropriate math- 
ematical models and efficient numerical methods for 
their solution, the level of accuracy required for opti- 
mizing the performance of an athlete or a team can be 
guaranteed. 

Computational fluid dynamics (CFD) is the branch of 
computational mechanics that uses numerical meth- 
ods to simulate problems that involve fluid flows. In 
the past two decades, CFD has become a key design 
tool for Formula 1 cars. However, FI is not the only 
sport in which mathematical/numerical modeling has 
been applied. In this article we discuss the results of 
three research projects— one concerning sailing, one 
rowing, and one swimming— undertaken by the authors 
in recent years, with the objective of highlighting the 
role that mathematics and scientific computing can 
play in these fields. 

2 Model Identification 

In every physical process, phenomena of different 
types (mechanical, chemical, electrodynamical) act at 
the same time and interact with each other. How- 
ever, in a practical modeling approach it is often 
possible to limit the analysis to a specific physical 
aspect. In the case of fluid dynamics, a mathemati- 
cal model based on continuous mechanics principles, 
given by the Navier-Stokes equations, is usually ade- 
quate. Even in this framework, further model identifica- 
tion efforts are required. Indeed, specific flow regimes 
(laminar/turbulent, compressible/incompressible, sin- 
gle/multiphase) demand different specializations of 
the original mathematical model. Moreover, a suit- 
able definition of the limited space/time computational 


domain and suitable choices for the initial and bound- 
ary conditions are required to obtain a well-posed 
mathematical problem. 

The hydrodynamics of a boat (or a swimmer) is well 
described by an incompressible, turbulent, two-phase 
flow model. Denoting by Q the three-dimensional com- 
putational domain around the boat (or the swimmer) 
that is occupied by either air or water, the navier- 
stokes equations [III.23] read as follows. For all x e Q 
and 0 < t < T, 

+ V • ( pu ) = 0, (1) 

+ V ■ (pu ® u) - V ■ t(u, p) = pg, (2) 
V ■ M = 0, (3) 

where p is the (variable) density, u the velocity, p 
the pressure, g = (0,0 ,g) T the gravitational acceler- 
ation, and r(u,p) = p(Vu + Vm t ) - pi denotes the 
stress tensor with p indicating the (variable) viscos- 
ity. Equation (1) prescribes mass conservation, equa- 
tion (2) the balance of momentum, and equation (3) 
the incompressibility constraint. As mentioned above, 
equations (l)-(3) should be complemented with suit- 
able initial and boundary conditions. 

Different approaches are available to simulate the 
free-surface dynamics of the water-air interface. In 
interface tracking methods, the interface is explic- 
itly reconstructed, while interface capturing techniques 
usually identify the interface as an (implicitly defined) 
isosurface of an auxiliary characteristic function. The 
flow solutions in air and water should satisfy the 
interface conditions 

— tt w , 

T a (M a ,p a ) ■ n = T w (M w ,p w ) ■ n + Kan, 

which, respectively, enforce continuity of velocities and 
equilibrium of forces on the interface. The surface ten- 
sion contribution Kan, which depends on the coeffi- 
cient a, the local curvature k, and the interface normal 
n, is usually negligible in the applications considered 
in this work. 

For the simulation of turbulent flows, which are com- 
monly encountered in hydrodynamics applications, a 
turbulence model is usually adopted. These models 
are able to include the main effect of the turbulent 
nature of the flow without capturing all the space- 
and timescales that characterize the flow. They usu- 
ally result in an additional turbulent viscosity px that is 
added to the physical viscosity p in the stress tensor t. 
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That is, 

t = (p + pj)(Vu + Vm t ) - pi. 

The system of partial differential equations (l)-(3) 
can be discretized using different numerical schemes. 
For instance, the spatial discretization may be based 
on finite-difference, finite-volume, or finite-element 
methods. Here, the finite-volume method is considered. 
The domain is subdivided into small control volumes 
(forming a so-called computational grid) and the dif- 
ferential equations (in their integral form) are imposed 
locally on each control volume. In this way one can for- 
mulate a discrete version of the problem (l)-(3) that, 
at the end, leads to the solution of a (usually large) 
linear system. The dimension of the linear system is 
strictly related to the dimension of the computational 
grid. In turn, the size of the grid usually depends on the 
flow regime. In particular, with high-speed and multi- 
phase flows with complex flow features (such as sep- 
aration or laminar-turbulent transition), the size of 
the computational grid can easily exceed several mil- 
lion elements. The solution of the associated large lin- 
ear system is carried out with state-of-the-art iterative 
methods (for example, multigrid methods) on parallel 
architectures. 

Numerical models based on the discrete solution of 
the Navier-Stokes equations usually guarantee that the 
results come with a high level of accuracy, but this 
comes at a high computational cost. This is partic- 
ularly true when viscous (and turbulent) effects play 
an important role in the simulated flows. In some 
cases— when one is mostly interested in reconstruct- 
ing the shape of the free surface around a boat, 
for example— reduced-order models [11.26] based 
on the Euler equations or even on potential flow 
theory can be adopted. Reduced-order models are 
often used for preliminary analysis or multiquery prob- 
lems (arising, for example, in genetic optimization). 
In other cases, the computational reduction strategy 
may be based on a geometrical multiscale approach, 
where the complete (and computationally expensive) 
three-dimensional model is only used in a very lim- 
ited portion of the original domain and is coupled 
with spatial (two-dimensional/one-dimensional/zero- 
dimensional) reduced models that account for the rest. 
The correct identification of the most suitable model 
for a specific problem should be based on a com- 
promise between the level of accuracy achieved by 
the model in estimating the objective function of the 
analysis at hand and its computational cost. 


3 Numerical Models for Fluid 
Dynamics in Sport 

3.1 The America’s Cup 

The America’s Cup is a highly competitive sailing yacht 
race in which even the smallest details in the design 
of the boats’ different components can make a big dif- 
ference to the final result. To analyze and optimize 
a boat's performance, the aerodynamic and hydro- 
dynamic flow around the whole boat should be sim- 
ulated, taking into account the variability of wind and 
waves, and the presence of the opposing boat and its 
maneuvering, as well as the interaction between fluids 
(water and air) and the boat’s structural components 
(its hull, appendages, sails, and mast). 

The main objective for the designers is to select 
shapes that minimize water resistance on the hull and 
appendages and maximize the thrust produced by the 
sails. Numerical models allow different situations to be 
simulated, thus reducing the costs (in terms of both 
time and money) of running experiments in a towing 
tank or a wind tunnel. 

To evaluate whether a new design idea is potentially 
advantageous, a velocity prediction program is often 
used to estimate the boat speed and attitude for a 
range of prescribed v\ind conditions and sailing angles. 
These numerical predictions are obtained by model- 
ing the balance between the aerodynamic and hydro- 
dynamic forces acting on the boat. The accuracy of 
speed and attitude predictions relies on the accuracy of 
the force estimation, which is usually computed by inte- 
grating experimental measurements in a towing tank, 
by analytical models, and by accurate CFD simulations. 

In order to perform the latter, any new shape is 
reproduced in a computer-aided design model (usu- 
ally defined by several hundred nonuniform rational 
B-spline (NURBS) surfaces for the whole boat). Based 
on this geometry, a computational grid can be gener- 
ated. This step is usually crucial for obtaining accu- 
rate predictions. In particular, for free-surface hydro- 
dynamic simulation it is often necessary to resort to 
block-structured grids in order to capture the strong 
gradients in the wall boundary layer (figure 1) and to 
reconstruct the free-surface geometry accurately. 

A detailed simulation of the complex flow around the 
appendages (keel, bulb, and winglets) is displayed in 
figure 2, where the typical recirculation associated with 
the lift generated by the keel in upwind sailing can be 
observed. 
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Figure 1 Detail of the grid in the boundary layer. 



As mentioned, the air and water interact with the boat 
and can change its configuration. In this respect, two 
classes of problems can be considered: 

• the rigid-body motion of the boat under the action 
of the fluid forces; 

• the fluid-structure interaction between the flexible 
sails and the wind. 

Different numerical schemes for the coupling between 
the flow solution and a structural model have been 
defined and are described in the following sections. 
Notice that in both cases, the Navier-Stokes equa- 
tions have to be reformulated in the so-called arbi- 
trary Lagrangian-Eulerian (ALE) formulation to deal 
with moving-domain problems. 


3.1.1 Boat Dyna m ics 

The attitude of the boat advancing in calm water or 
on a wavy sea, as well as its dynamics, may strongly 
influence its performance. For this reason, a numerical 
tool for yacht design should be able to predict not only 
the performance of the boat in a steady configuration 
but also the boat’s motion and its interaction with the 
flow field. 

This can be achieved by coupling the flow solver with 
a six-degrees-of-freedom dynamical system describing 
the rigid-body motion of the boat. Consider an iner- 
tial reference system (0,X, Y, Z), moving forward with 
the mean boat speed, and a body-fixed reference sys- 
tem (G, x, y, z), centered in the boat center of mass G, 
which translates and rotates with the boat. The boat 
dynamics is computed by integrating the equations 
of variation of linear and angular momentum in the 
inertial reference system, which are 

mX G = F, 

tit ~ L n + nx tit = m g , 

with m denoting the boat mass; X G the linear accel- 
eration; F the force acting on the boat; II and Cl the 
angular acceleration and velocity, respectively; and M G 
the moment with respect to G acting on the boat. The 
tensor of inertia I is computed in the body-fixed ref- 
erence system, and the rotation matrix T between the 
body-fixed and inertial reference systems is required to 
write the problem in the inertial reference system. 

The forces and moments acting on the boat are in this 
case rather simple and include the flow force (FfIow) and 
moment (M f i ow ) and the gravitational force: 

F = F f1ow + mg, M g = M f1ow . (4) 

Using this model it was possible to simulate the 
dynamics of the boat in different conditions, analyzing 
both calm water and wavy sea scenarios. In figure 3, 
an example of the free-surface distribution of a hull 
advancing in regular waves is shown. 

3.1.2 Wind-Sails Interaction 

Sails are flexible structures that deform under the 
action of the wind. The pressure field acting on the sail 
changes its geometry and this, in turn, alters the flow 
field. 

In mathematical terms, this problem can be defined 
as a coupled system that comprises a fluid problem 
J defined on the moving domain fl F (t) surrounding 
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Figure 3 Free-surface simulation in wavy conditions. 


the sail, a structural problem S defined on the mov- 
ing domain 13s (t), and a mesh motion model M. The 
structural model is usually based on a second-order 
elastodynamic equation depending on the sail defor- 
mation. Without entering into a detailed formulation 
of the structural and mesh motion models, we focus 
on the coupled nature of the problem. 

In abstract form, the coupled problem can be formu- 
lated as 


J(u, p, w) = 0 

in 13 F (t),’ 

S(d) = 0 

in 13 s (t), 

Min) = o 

in 13°, 

u = d 

onf(i), 

<r F (M,p)n F = a s (d)n s 

onf(i), 

n = d 

onT°, 


where the three (fluid, structure, and mesh motion) 
problems are coupled through three conditions over 
the interface F(t) stating the continuity of velocity, 
the equilibrium of forces, and the geometric continuity, 
respectively. In (5), d is the structural displacement, t] 
the mesh displacement, and 13° and F° denote a ref- 
erence fluid domain and interface, respectively. The 
fluid mesh motion velocity w, needed in the arbitrary 
Lagrangian-Eulerian formulation of the Navier-Stokes 
equations, depends on the mesh displacement ij as 
w = i). 

For the solution of problem (5), different fluid-struc- 
ture interaction schemes can be devised. In monolithic 
schemes a global system is assembled and solved for 
all the unknowns of the problem simultaneously. A dif- 
ferent approach, usually preferred when one wants to 
exploit already existing fluid and structural solvers, is 


Figure 4 An example of a transient 
fluid-structure interaction solution. 

provided by partitioned schemes in which the fluid, 
structure, and mesh motion problem are solved iter- 
atively. For the case at hand, a strongly coupled parti- 
tioned scheme has been devised and this scheme guar- 
antees that at each time step equilibrium is reached 
between the different subproblems. To compute the 
solution at time t n+1 , a subiteration between the three 
subproblems is required. The structural solution is first 
computed by solving 

S(<O = 0 in (13s)” +1 , 

<r s (n s )fc +1 = tr F (M,p)£ +1 (n F )£ +1 on r, 
followed by solution of the mesh motion update 
> = 0 in 13°, 

iJk+i = at ^k + i + (1 - a)d™ +1 on r°, 
and finally undertaking the flow solution 

= 0 in (Qr)ltb 

<11 = «<i + a - «)4” +1 on r? + \\ 

where a relaxed value of the structure displacement at 
the current iteration k + 1 and the previous subiteration 
k is used. The iteration over the index k is stopped when 
a suitable convergence criterion is fulfilled. 

Several fluid-structure interaction simulations of 
sails have been carried out using this model, which 
has proved its ability in predicting both steady fly- 
ing shapes and sail dynamics under different trimming 
conditions. An example of a transient fluid-structure 
interaction simulation can be seen in figure 4, where the 
shape of the sail at several time instants is displayed. 
The flow pattern around mainsail and gennaker in a 
downwind sail configuration is shown in figure 5. 

3.2 Olympic Rowing 

Can numerical models developed for the America’s Cup 
be adopted in other hydrodynamics applications in 
sport? Consider, for instance, the prediction of a row- 
ing boat’s performance. In principle, the physical prob- 
lem is very similar to one of those already described: 



602 


V. Modeling 



Figure 5 Streamlines around mainsail and gennaker. 


namely, a free-surface flow in a moving-domain frame- 
work. What characterizes the rowing action, however, 
is the motion of the rowers on board. Indeed, a men’s 
quadruple scull weighs around 50 kg and has four 
100 kg rowers on board who move with a cadence of 
around 40 strokes per minute. This motion has a strong 
effect on the global performance of the boat that should 
be captured by the simulation tool. 

To account for the motion of the rowers, the forces 
and moments acting on the boat, which were simple 
in the rigid-body dynamics of the sailing boat (see (4)), 
become more involved in this case and should account 
for the additional forces that the rowers exert on the 
boat through the oars, the seats, and the footboards: 
n n n 

F = X f °i + X + X F tj + + m 3’ 

j= 1 1=1 1=1 

n n 

M g = X (*o j ~G)x F 0 , + X (*sj - G) X F Sj 
j= 1 J= 1 

n 

+ X (X t] - G) x F tj + Mu OW) 
j = 1 

( 6 ) 

where F 0j , F Sj , and Ffj denote the forces exerted by 
the j th rower on his/her oarlock, seat, and footboard, 
respectively, and X Qj , X Sj , and Xfj are the correspond- 
ing application points. To compute these additional 
forcing terms it is necessary to reconstruct the kine- 
matics of the rowers’ action. A schematic of the boat/ 
rowers/oars system is given in figure 6. An accurate 
prediction of these forces is crucial for the performance 
analysis of the system. These periodic forcing terms, in 
fact, induce relevant secondary motions of the boat that 



Figure 6 Sketch of the boat/rowers/oars system. 



Figure 7 Free-surface flow around a rowing boat. 


generate an additional wave radiation effect, increasing 
the global energy dissipation of the system. Indeed, the 
quality of a professional rower is evaluated not only by 
the force he/she is able to express during the active 
phase of the stroke but also on his/her ability to per- 
form a well-balanced stroke cycle that minimizes the 
secondary motions. 

The hydrodynamic forcing terms FfIow and Mfi 0 w hi 
system (6) can be computed using either a complete 
Navier- Stokes flow model or a reduced hydrodynamic 
model based on a linear potential flow model. The for- 
mer was used, for example, to investigate the perfor- 
mance of different boat designs, simulating the free- 
surface flow around them (see figure 7) and the rolling 
stability of the system (see figure 8); using the latter 
approach, full race simulations could be carried out in 
(almost) real time. 

3.3 Swimming 

In recent years, competition swimming has been unset- 
tled by the introduction of high-tech swimsuits. These 
products resulted from scientific research on the devel- 
opment of low-drag fabrics and new techniques of fab- 
ric assembly. Thanks to the improved performance 
guaranteed by these new swimsuits, in a couple of years 




Figure 8 Rolling motion of a rowing boat. 

(2008-9) more than 130 swimming world records were 
broken. In 2010 the International Swimming Federa- 
tion (FINA) decided to ban the use of these products 
in official competitions. 

In this context, the authors have (in collaboration 
with a leading swimsuit manufacturer) been involved in 
a research activity to estimate the potential gain asso- 
ciated with a new technique of fabric assembly based 
on a thermo-bonding system in which all the standard 
sewing was completely removed from the swimsuit sur- 
face. This new swimsuit was approved by FINA before 
the Beijing 2008 Olympic Games. 

In order to quantify the advantages of removing the 
sewing, its impact on the flow around the swimmer 
needed to be analyzed. A complete simulation includ- 
ing the geometrical details of the swimmer’s body as 
well as the sewing on the swimsuit is unaffordable, due 
to the different length scales involved (see figure 9). 
Model reduction was therefore mandatory. At first, the 
flow around the sewing was analyzed considering a 
local computational domain limited to a small patch 
of fabric surrounding the sewing. 

The sewing geometry was reconstructed in detail 
to generate a computer-aided design model. Based 
on this geometry, an unstructured computational grid 
was generated and the flow around the sewing was 
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Figure 10 Local flow around the 
sewing at different incidences. 


simulated considering different flow velocities and ori- 
entations (see figure 10). The presence of the sewing 
strongly affects the flow in its boundary layer, and this 
results in an additional drag component. 

Full-body simulations (see figure 11) were then car- 
ried out to estimate, for all the sewing distributed over 
the swimsuit, the local velocity and orientation, so that 
the results of the small-scale simulation could be inte- 
grated to obtain a global value of the drag component 
associated with the presence of the sewing. 

To move from an estimate of the drag reduction to 
an estimate of the reduction in race time, a race model 
was developed, based on the Newton law: 

H 2 v- ( f) 

m 2.; =P(t)-D(t), x'(0) = 0, x(0) = 0, 

dt^ 

where m and x are the mass and position of the swim- 
mer, respectively. The resistance D ( t ) can be estimated 
based on the results of the CFD simulation described 
above, while the propulsion P(t) could also be com- 
puted through ad hoc simulations of the stroke action 
(see figure 12). 
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Figure 1 1 Full-body simulation of 
the flow around a swimmer. 



Table 1 Time gains in seconds for different freestyle races. 



Gliding 

Length 

Race 

50 m 

0.011 

0.073 

0.073 

100 m 

0.011 

0.077 

0.154 

200 m 

0.012 

0.083 

0.332 

400 m 

0.013 

0.087 

0.696 


Our analysis made it possible to quantify the poten- 
tial gain in terms of race time associated with the new 
swimsuit design, which is the most useful metric of the 
performance improvement. The time gains for different 
race lengths are reported in table 1 , where the gains in 


the gliding phase, over one length and over the whole 
race, are specified. These relevant race time improve- 
ments estimated by numerical simulations were con- 
firmed by an experimental campaign carried out in 
the University of Liege’s water tank by the biomechan- 
ics research group of the University of Reims (France) 
under the supervision of Redha Taiar. 
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V.3 Inerters 

Malcolm C. Smith 


1 Introduction 

A new ideal mechanical modeling element, coined the 
“inerter,” was introduced by the author in 2002 as a 
component that was needed for the solution of the 
following mathematical question (although it is not 
purely mathematical: it also touches on physics and 
engineering). 

[P] What is the most general linear passive mechanical 
impedance function that can be realized physically? 
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An analogy with electrical circuits suggested that a 
solution ought to be straightforward. But this turned 
out not to be the case, and a new mechanical modeling 
element was needed. Embodiments of inerter devices 
were built and tested at the University of Cambridge’s 
engineering department. The technology was subse- 
quently developed by a Formula 1 team, and the inerter 
is now a standard component of many racing cars. The 
story of this development from mathematical theory to 
engineering practice is described below. 

2 Network Analogies 

There are two analogies between electrical and mechan- 
ical networks that are in common use. The oldest 
of these uses the correspondences force « voltage 
and velocity *- current. An alternative analogy, usu- 
ally attributed to Firestone (1933), makes use of the 
correspondences 

force « current, 
velocity « voltage, 

mechanical ground « electrical ground. 

Although neither analogy can be said to be correct or 
incorrect, the force-current analogy has certain advan- 
tages. Firstly, from a topological point of view, network 
graphs are identical in the electrical and mechanical 
domains when the force-current analogy is used. This 
means that any electrical network has a mechanical 
analogue, and vice versa, whereas in the older force- 
voltage analogy only networks v\ith planar graphs have 
an analogue in the other domain, since the dual graph 
is involved in finding the analogue. Secondly, in the 
force-current analogy, mechanical ground (namely, a 
fixed point in an inertial frame) corresponds to elec- 
trical ground (namely, a datum or reference voltage). 
Thirdly, and more fundamentally, the force-current 
analogy is underpinned by the notion of through and 
across variables. A through variable (such as force or 
current) involves a single measurement point and nor- 
mally requires the system to be severed at that point 
to make the measurement. In contrast, an across vari- 
able (such as velocity or voltage) is the difference in an 
absolute quantity between two points and can in princi- 
ple be measured without breaking into the system. The 
notion of through and across variables allows analo- 
gies to be developed with other domains (e.g., acoustic, 
thermal, fluid) in a systematic manner. 

It is important to point out, since both force x velocity 
and current x voltage have the units of power, that 


both analogies are power preserving. Questions such as 
[P] that relate to passivity (meaning no internal power 
source) can therefore be correctly mapped from one 
domain to another. 

3 The Inerter 

The standard correspondences between circuit ele- 
ments in the force-current analogy are as follows: 

spring « inductor, 
damper « resistor, 

mass « capacitor. 

A well-known fact, which is only rarely emphasized, 
is that only five of the above six elements are gen- 
uine two-terminal devices. For mechanical devices, the 
terminals are the attachment points. Both the spring 
and the damper have two independently movable ter- 
minals. However, the mass element has only one attach- 
ment point: the center of mass. Its behavior is described 
by Newton’s second law, which involves the accelera- 
tion of the mass relative to a fixed point in an inertial 
frame. This gives rise to the interpretation that one ter- 
minal of the mass is the ground and the other termi- 
nal is the position of the center of mass itself; effec- 
tively, that the mass is analogous to a grounded capac- 
itor. This means that not every electrical circuit com- 
prising inductors, capacitors, and resistors will have a 
spring-mass-damper analogue. 

There is a further problem regarding the mass ele- 
ment in relation to [P]. Any mechanism that is con- 
structed to realize a given impedance function is typ- 
ically intended for connection between other masses, 
and must therefore have a small mass in relation 
to those. Any construction involving mass elements 
would need to ensure that the total mass employed 
could be kept as small as desired. 

The above considerations were the motivation for the 
following question posed in Smith (2002): is it possi- 
ble to construct a two-terminal mechanical device with 
the property that the equal and opposite force applied 
at the terminals is proportional to the relative accel- 
eration between them? It was shown that devices of 
small mass could be constructed that approximate this 
behavior. Accordingly, the proposal was made to define 
an ideal modeling element as follows. 

Definition 1. The (ideal) inerter is a mechanical two- 
terminal device with the property that the equal and 
opposite force applied at the terminals is proportional 
to the relative acceleration between them. That is, 
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Figure 1 A free-body diagram of a one-port (two-terminal) 
mechanical element or network with force-velocity pair 
(F,v), where v = V2 - V\. 
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Figure 3 A one-port electrical network. 


Rack Pinions 



Figure 2 Schematic of a mechanical model of an inerter 
employing a rack, pinion, gears, and flywheel. 

F = b(v 2 - i>i) in the notation of figure 1. The con- 
stant of proportionality b is called the inertance and 
has units of kilograms. 

As is the case for any modeling element, a distinc- 
tion needs to be made between the ideal element (as 
defined above) and practical devices that approximate 
it. The manner in which practical realizations deviate 
from ideal behavior may determine whether a defini- 
tion is sensible at all. In the case of the inerter, it has 
already been noted that realizations are needed with 
sufficiently small mass, independent of the inertance 
b. Another requirement is that the “available travel” 
(the relative displacement between the terminals) of 
the device can be specified independently of the iner- 
tance. One method of realization that can satisfy all 
these requirements makes use of a plunger sliding in a 
housing that drives a flywheel through a rack, pinion, 
and gears (see figure 2). 

The circuit symbols of the six basic electrical and 
mechanical elements, with the inerter replacing the 
mass, is shown in table 1. The symbol chosen for the 
inerter represents a flywheel. The table also shows 
the defining (differential) equation for each element as 
well as the admittance function 7(5) = Z(5) -1 , where 
Z(s) is the impedance function. Taking Laplace trans- 
forms of the defining equation, assuming zero initial 


conditions, the impedance for the electrical elements 
is defined by Z(5) = i(s)/v{s), and similarly for the 
mechanical elements. 

4 Passivity and Electrical Network Synthesis 

Let us consider the problem in the electrical domain 
that is analogous to [P]. Figure 3 shows an electri- 
cal network with two external terminals (a “one-port”) 
across which there is a voltage v (t) and a correspond- 
ing current flow i(t) through the network. The driving- 
point impedance of the network is defined by Z(s) = 
i(s) /v(s). The network is defined to be passive if it con- 
tains no internal power source. More formally, we can 
define the network to be passive if, for all admissible v 
and i that are square integrable on (-oo, T], 

v(t)i(t) dt > 0. 

In his seminal 1931 paper, Brune introduced the notion 
of a positive-real function: for a rational function with 
real coefficients, Z(s) is defined to be positive-real if 
Z(s) is analytic and Re(Z(5)) ^ 0 whenever Re(5) > 0. 
Equivalently, Z(5) is positive-real if and only if Z(5) is 
analytic in Re(5) > 0, Re(Z(jtu)) ^ 0 for all to at which 
Z(jto) is finite, 1 and any poles of Z(s ) on the imagi- 
nary axis or at infinity are simple and have a positive 
residue. Brune showed that a necessary condition for 
the network to be passive is that Z(5) is positive-real. 

Brune also showed the converse: for any (rational) 
positive-real function Z(s) a network can be con- 
structed (synthesized) comprising resistors, capaci- 
tors, inductors, and mutual inductances whose driving- 
point impedance is equal to Z(5). In Brune's construc- 
tion, mutual inductances were required with a coupling 
coefficient equal to 1, but that is difficult to achieve 
in practice. This led to the question of whether cou- 
pled coils (transformers) could be dispensed with alto- 
gether. This was settled in the affirmative in a famous 
1949 paper of Bott and Duffm. 


1. In the electrical engineering convention, j = v’-T. 
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Table 1 Circuit symbols and correspondences with defining equations and admittance Y{s). 


Mechanical 


Electrical 


YWTY 

v 2 Vj 


dF 

dt 


= k(v 2 - v\) 


Y(s) = - 
s 


Spring 




d i 1 

d t = L {V2 ~ Vl) 


Y (s) = — 
Ls 


Inductor 



F 


L 

Vl 


Y(s) = c 


F = c(v 2 - Vi) Damper 



i = 


1 

R 


(v 2 - Vi) 


ns) = i 

Resistor 


F 

L 


v 2 


F 


L 

Vl 


d(u 2 - Vi) 

F = b 

dt 


Y(s) = bs 

Inerter 




. d(v 2 -vi) 
l ~ dt 


Y(s) = Cs 

Capacitor 


The Bott-Duffin construction certainly settled an 
important question in network synthesis. However, it 
also raised further questions since the construction 
appears, for some functions, to be rather lavish in 
the number of elements used. For example, the con- 
struction uses six energy storage elements for some 
biquadratic functions (ratios of two quadratics) when 
it might seem that two should suffice. The question of 
minimality remained unresolved when interest in pas- 
sive circuit synthesis started to decline in the late 1960s 
due to the increasing prevalence of active circuits. Some 
of these issues are now being revived because of appli- 
cations in the mechanical domain, where efficiency of 
realization is an important issue. 

Let us now return to the problem [P] in the mechani- 
cal domain. All the ingredients are now in place to pro- 
vide a solution. Defining the mechanical driving-point 
impedance by Z(s) = F(s)/v(s), using the theorem 
of Bott and Duffin, the force-current analogy, and the 
ideal inerter, we see that Z(s) can be realized as the 
impedance of a passive mechanical network if and only 
if Z(s) is positive-real. 

5 Vehicle Suspensions 

A simple model for studying vehicle suspensions is the 
quarter-car vehicle model described by the equations 

m s z s = u-F s , 
m u z u = -u + k t (z r - z u ), 


where m s is the sprung mass, tn u is the unsprung mass, 
k t is the tire (vertical) spring stiffness, u is an active or 
passive control force, F s is a load disturbance input, 
z r is the vertical displacement of the road input, and z s 
and z u are the vertical displacements of the sprung and 
unsprung masses. A conventional passive suspension 
consists of a spring and a damper in parallel, which 
means that u = k s (z u - z s ) + c s (z u - z s ), where k s 
and c s are the spring and damper constants, respec- 
tively. In the most general linear passive suspension, 
u = Q_(s)s(z ll - z s ), where Q(s) is a general positive- 
real admittance function, as depicted in figure 4. After 
a suitable positive-real admittance is found, circuit syn- 
thesis techniques come into play to find a network of 
springs, dampers, and inerters to realize Q(s). Alter- 
natively, the parameters of simple circuits can be opti- 
mized directly; for example, a spring, damper, and 
inerter in parallel. 

When designing an active or passive suspension 
force law u, a number of different performance cri- 
teria are relevant. Some of these criteria are related 
to the response to road undulations and come under 
the category of “ride” performance. A common (simple) 
assumption is that z r is integrated white noise (Brown- 
ian motion). A second set of criteria relate to “han- 
dling” performance, which is the response to driver 
inputs, such as braking, accelerating, or cornering. In 
the quarter-car model these are sometimes approxi- 
mated very crudely by the response to deterministic 
loads F s . Variables of relevance include z s (comfort) 
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Figure 4 The quarter-car vehicle model. 

and k t (z r - z u ) (tire grip). Performance measures can 
typically be improved by using a general linear pas- 
sive suspension instead of a conventional spring and 
damper. 

As an interesting aside— and an illustration of the 
force-current analogy— figure 5 shows the equivalent 
electrical circuit for the quarter-car model. It is always 
amusing to point out to vehicle dynamicists that the 
electrical circuit highlights the interpretation that the 
sprung mass and the unsprung mass are connected to 
the ground, but ... the tire is not: it is connected to the 
road (here modeled as a velocity source)! 

6 The Inerter in Formula 1 Racing 

For racing cars, a very important performance measure 
is “mechanical grip,” which refers to the tire load fluc- 
tuations in response to unevenness of the road surface. 
Increased grip corresponds to decreased fluctuations. 
For stiffly sprung suspensions (invariably the case in 
racing cars), mechanical grip can be improved by the 
simple expedient of placing an inerter in parallel with 
the conventional spring and damper. 

The story of the development and use of the inerter in 
Formula 1 racing has been recounted a number of times 
in the popular press and in magazines. The inerter was 
developed by McLaren Racing, under license from the 
University of Cambridge and a cloak of confidentiality. 



Figure 5 The equivalent electrical circuit 
for the quarter-car model. 


It was raced for the first time by Kimi Raikkonen 
at the 2005 Spanish Grand Prix, where he achieved 
McLaren's first victory of the season. During develop- 
ment, McLaren invented an internal decoy name for the 
inerter (the “J-damper”) to make it difficult for person- 
nel who might leave to join another team to make a con- 
nection with the technical literature on the inerter that 
was being published. This strategy succeeded in spec- 
tacular fashion during the 2007 Formula 1 “spy scan- 
dal,” when a drawing of the McLaren J-damper came 
into the hands of the Renault engineering team. Renault 
made an attempt to get the device banned (unsuccess- 
fully) on an erroneous interpretation of the device. 
The Federation Internationale de l’Automobile World 
Motor Sport Council convened in Monaco on Decem- 
ber 6, 2007, to investigate a spying charge brought by 
McLaren. The council found Renault to be in breach 
of the sporting code but issued no penalty. In Para- 
graph 8.7 of the World Motor Sport Council Decision, 
the council reasoned that 

the fact that Renault fundamentally misunderstood the 
operation of the system suggests that the “J-damper” 
drawing did not reveal to Renault enough about the 
system for the championship to have been affected. 

During the hearing, neither the World Motor Sport 
Council nor McLaren made public what the J-damper 
was. Thereafter, speculation increased on Web sites and 
blogs about the function and purpose of the device. 
Finally, the truth was discovered by Autosport mag- 
azine, which revealed the Cambridge connection and 
that the J-damper was an inerter. Soon afterward, the 
University of Cambridge signed a licensing agreement 


V.4. Mathematical Biomechanics 


609 



Figure 6 A ballscrew inerter made in the University of 
Cambridge engineering department in 2003, with flywheel 
removed. 


with Penske Racing Shocks to allow inerters to be sup- 
plied to other customers within Formula 1 and else- 
where. The use of inerters has now spread beyond 
the Formula 1 grid to IndyCars and several other 
motorsport formulas. A typical method of construction 
makes use of a ballscrew, as shown in figure 6. 
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V.4 Mathematical Biomechanics 

Oliver E. Jensen 


1 Physical Forces in Biology 

Biological organisms must extract resources from their 
environment in order to survive and reproduce. Ani- 
mals need oxygen, water, and the chemical energy 
stored in food. Plants require a supply of carbon diox- 
ide, water, and light. Different species have developed 


diverse mechanisms for harvesting these essential 
nutrients. For example, tiny single-celled microorgan- 
isms such as the bacterium Escherichia coli swim 
along gradients of nutrient, whereas large multicellular 
organisms such as humans have internal energy trans- 
port systems of great geometric complexity. Mathemat- 
ics has an important role to play in understanding the 
relationship between the form and function of these 
remarkable natural systems, and in explaining how 
organisms have adapted to the physical constraints of 
the world they inhabit. 

A nutrient in molecular form can be taken up by an 
organism by crossing a cell membrane, generally via 
molecular diffusion. While diffusion over such short 
distances is rapid, it is generally not an effective trans- 
port mechanism over long distances because diffusion 
time increases in proportion to the square of distance. 
Organs performing gas exchange (such as lungs, gills, 
or leaves) therefore exploit fluid flow (in which nutri- 
ents are carried by a moving gas or liquid) for long- 
distance transport, coupling this with an exchange sur- 
face that presents a large area for short-range diffusion. 
Similarly, most organisms that are larger than just a 
few cells generate internal flows to distribute nutrients 
around their bodies. The airways of our lungs, which 
have a surface area nearly as large as a tennis court but 
a volume of just a few liters, are therefore entwined 
with an elaborate network of blood vessels, which can 
deliver oxygen rapidly to other organs. In plants, light is 
harvested by maximizing the exposure of leaves to the 
rays of the sun, while nutrients generated through pho- 
tosynthesis in leaves are carried to flowers and roots by 
an extensive internal vascular system. Naturally, many 
organisms will also move themselves to regions rich in 
nutrients using strategies (swimming, flying, running, 
etc.) that are appropriate to their environment. 

Mathematical models of any of these systems allow 
one to encode and quantify physical laws and con- 
straints. This may be done simply at the level of kine- 
matics or transport, i.e., understanding motion in time 
and space or expressing conservation of mass; alterna- 
tively, models may involve analysis of physical forces 
through relationships such as Newton’s laws. Often, 
mechanical models are coupled to descriptions of bio- 
logical processes, such as cell signaling pathways, gene 
regulatory networks, or transport of hormones, allow- 
ing a quantitative “systems-level” description of a cell, 
organ, or organism to be developed. Such approaches 
can be essential to understand the full complexity of 
biological control mechanisms. A wealth of theoretical 
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developments from other areas of science and engi- 
neering can be exploited in developing such models. 
However, biomechanical problems can have a flavor of 
their own, and they raise some particular mathematical 
challenges. 

These become clear through a comparison with a 
“classical” field such as fluid mechanics. The governing 
equations describing the motion of most fluids were 
established in the first half of the nineteenth century 
and are firmly grounded on Boltzmann’s kinetic theory. 
The navier-stokes equations [III.23] are remarkable 
for many reasons: they can describe fluid motions over 
at least ten orders of magnitude (from submicron to 
planetary scales); they apply equally well to distinct 
phases of matter (liquids and gases); and in many sit- 
uations it is sufficient to describe fluids v\ith just a 
handful of physical parameters (a density, a viscosity, 
and possibly a bulk modulus), which can be reduced 
to an even smaller number of parameters (a Reynolds 
number, a Mach number) through nondimensional- 
ization [II.9]. Fluid flows arising in very different sit- 
uations can therefore have essentially identical mathe- 
matical descriptions, often enabling experiments to be 
performed at comfortable bench-top scales involving 
nonintrusive measurements. While the field continues 
to raise profound mathematical questions (such as the 
origin and structure of turbulence [V.21]), the topic 
as a whole is built on secure theoretical foundations. 

It can therefore be disconcerting to face some of 
the awkward realities of the biological world, where 
features such as universality, scale independence, effi- 
cient parametrization, and nonintrusive experimenta- 
tion can be illusory. Some specific challenges are as 
follows. 

• Nature’s remarkable biodiversity is reflected by 
a proliferation of different mathematical models, 
which often have to be adapted to describe a spe- 
cific organ, organism, or process. A model might 
even be tailored to an individual, as is the case in 
the growing field of personalized medicine. How- 
ever, models can sometimes be connected by uni- 
fying principles (such as the concept of natural 
selection driving optimization of certain structural 
features in a given environment). 

• Biological materials are typically highly heteroge- 
neous and have distinct levels of spatial and tem- 
poral organization spanning many orders of magni- 
tude (from molecules and cells up to whole popula- 
tions of organisms), requiring different mathemat- 


ical descriptions at each level of the hierarchy. For 
example, the beating of our hearts over most of a 
century relies on the cooperative action of numer- 
ous molecular ion channels in heart muscle cells, 
flickering open and shut within milliseconds. 

• Experimental measurements of key parameters can 
be exceptionally challenging, given that many bio- 
logical materials function properly only in their 
natural living environment. Biological data gath- 
ered in vitro (for example, on tissue cultured in 
a petri dish) can misrepresent the true in vivo 
situation. Parameter uncertainty represents a pro- 
found difficulty in developing genuinely predictive 
computational (in silico) models. 

• Biological processes do not generally operate in iso- 
lation: they are coupled to numerous others, mon- 
itoring and responding to their environment. For 
example, plant and animal cells have a variety of 
subcellular mechanisms of mechanotransduction, 
whereby stimuli such as stress or strain have direct 
biological consequences (such as protein produc- 
tion or gene expression). 

• Unlike the relatively passive structures that domi- 
nate much of engineering science, biological mate- 
rials generate forces (via osmotic pumps in plants, 
or via molecular motors, driven by chemical en- 
ergy stored in adenosine triphosphate; arrays of 
motors can be organized to form powerful mus- 
cles). Organisms change their form through growth 
(the development of new tissues) or remodeling 
(the turnover of existing tissues in response to a 
changing environment). Examples of remodeling 
include the manner in which our arteries gradually 
harden as we age or the way a tree on a windswept 
hillside slowly adopts a bent shape. 

This brief survey presents a few examples illustrat- 
ing different applications of biomechanical modeling, 
all with a multiscale flavor. This is an endeavor that 
draws together research from numerous disciplines 
and should not be regarded as a purely mathematical 
exercise. However, mathematics has a key role to play 
in getting to grips with biological complexity. 

2 Constitutive Modeling 

Classical continuum mechanics is based on a num- 
ber of simplifying assumptions. Under the so-called 
continuum hypothesis, one considers a representative 
volume element of material that is much larger than 
the underlying microstructure but much smaller than 
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the size of the object that is undergoing deformation. 
The separation between the microscopic and macro- 
scopic length scales allows the discrete nature of the 
microstructure to be approximated by smoothly vary- 
ing scalar, vector, or tensor fields; physical relation- 
ships are then expressed by partial differential equa- 
tions (PDEs) or integral equations at the macroscopic 
level. While this is a very effective strategy for homo- 
geneous materials (such as water or steel), for which 
the granularity of the microstructure appears only at 
molecular length scales, it is less obviously appropriate 
for heterogeneous biological materials. 

In describing a given material it is often necessary 
to make a constitutive assumption : namely, a choice 
over how to model the material properties, typically 
expressed as a relationship between the tensor fields 
describing stress, strain, and strain rate. This assump- 
tion has to be tested against observation and may 
require careful judgement that takes into account the 
question that the model seeks to address, the quality of 
available experimental data, and so on. Nevertheless, a 
number of canonical model choices are available. The 
material might be classified as elastic, viscous, visco- 
elastic, viscoplastic, etc., depending, for example, on 
whether deformations are reversible once any loading 
is removed (as in the elastic case). One might then con- 
sider whether a linear stress/ strain/strain rate relation- 
ship is sufficient (this is appealing because of the small 
number of material parameters and the analytic and 
computational tractability) or whether a more complex 
nonlinear relationship is required. 

The Chinese-American bioengineer Y. C. Fung has 
been prominent among those pioneering the transla- 
tion of continuum mechanics to biological materials, 
particularly human tissues of interest in the biomedi- 
cal area. As he and many others have shown, “off-the- 
shelf” constitutive models can be usefully deployed to 
describe materials such as tendon, bone, or soft tis- 
sue. By fitting such models to experimental data, the 
corresponding mechanical parameters (Young’s modu- 
lus, viscosity, etc.) can be estimated, allowing predictive 
models to be developed. However, in many cases stan- 
dard models turn out to be inappropriate; we discuss 
below the exotic properties of blood in small blood ves- 
sels, where a description as a Newtonian homogeneous 
fluid fails. 

A particularly challenging area has been in character- 
izing the mechanical properties of individual cells. Ani- 
mal cells have a phospholipid membrane that encloses 
a cytoplasm and a nucleus. The cytoplasm is a crowded 


soup of organelles and proteins, among which the 
components of the cytoskeleton (which includes fibers 
(actin), rods (microtubules), and molecular motors 
(myosin, dynein, etc.)) are of particular mechanical 
importance. Processes such as cell motility and division 
transcend any classical engineering description, and 
instead biomechanical models have been developed 
that take a “bottom-up” description of the evolving 
microstructure (accounting, for example, for dynamic 
actin polymerization and depolymerization). Among 
animal cells, the red blood cell is relatively simple 
in mechanical terms, because its properties are dom- 
inated by its membrane, which can be described as a 
passive structure that resists area changes, and which 
has measurable resistance to in-plane shear and bend- 
ing. Plant cells, too, are typically dominated mechan- 
ically by the properties of their stiff cell walls rather 
than their cytoskeleton. 

The properties of a multicellular tissue will be deter- 
mined by the collective properties of a population of 
adherent cells. If the cells form a roughly periodic array, 
then the asymptotic technique of homogenization 
[11.17] can be used to derive a tissue-level description 
(e.g., in terms of PDEs) by careful averaging of proper- 
ties at the cell level. This multiscale method works well 
for strictly periodic structures and can be applied (with 
some care) to disordered arrays. However, when there 
is sufficient heterogeneity at the microscale, the valid- 
ity of a macroscale approximation (which assumes that 
properties vary over distances that are very long com- 
pared with individual cells) may become questionable. 
A variety of computational methods (cell-center mod- 
els, vertex-based models, cellular Potts models, etc.) 
can instead be used to resolve the granular nature of 
multicellular tissues. 

3 Blood Flow 

Despite significant medical advances over recent de- 
cades, cardiovascular disease (heart attack, stroke, and 
related conditions) remains a leading cause of death in 
Europe and North America. There is therefore intense 
interest in the pump and plumbing of our blood-supply 
system and the mechanisms by which it fails. Mathe- 
matical modeling has contributed significantly to this 
effort. 

Blood is a suspension of cells in plasma. The ma- 
jority of cells are erythrocytes (red blood cells): small 
highly deformable capsules containing hemoglobin. 
These cells are efficient carriers of oxygen from lungs 



612 


V. Modeling 


to tissues. While additional cells are present in blood, 
many of which have important roles in fighting infec- 
tion, the rheology of blood is determined primarily by 
the collective properties of the red cells. 

Red blood cells have a disk-like shape and a diame- 
ter of only 8 microns but are present in very high con- 
centrations (typically 45% by volume). In large blood 
vessels, the suspension can therefore be well approxi- 
mated as a homogeneous liquid, and a description as an 
incompressible “Newtonian” fluid (that is, one charac- 
terized by a constant viscosity p and a constant density) 
is effective. Poiseuille’s exact solution of the Navier- 
Stokes equations describes steady flow along a uniform 
pipe of radius a and length L, via 


u(r) = 


■ r 2 A p 


Q 


( 1 ) 


na 4 Ap 

4 p L ’ ^ 8 pL 

where u(r) is the axial velocity component, r is the 
radial coordinate, and Ap is the pressure difference 
between the inlet and the outlet of the pipe. Equa- 
tion (1) shows that the axial flow rate Q (the volume of 
fluid passing a fixed location per unit time) driven by 
a fixed Ap is proportional to a 4 , implying that narrow 
vessels present a very high flow resistance. The shear 
stress (tangential force per unit area) exerted by the 
flow on the pipe wall is -pu'(a) = 4pQ/na 3 . While 
Poiseuille himself had a keen interest in understanding 
blood flow, few blood vessels are sufficiently long and 
straight, or have sufficiently steady flow within them, 
for his solution (1) to be directly relevant. For exam- 
ple, coronary arteries, feeding oxygenated blood to the 
muscle of the heart, have tortuous geometries and are 
either embedded in, or sit upon, muscle that under- 
goes vigorous periodic contractions. Vessel curvature, 
torsion, branching, nonuniformity, wall deformation, 
and spatial movement all lead to significant deviations 
from (1). 

This observation's biomedical significance emerged 
in the 1980s when it was first appreciated how the 
endothelial cells that line blood vessels respond to 
stresses arising from blood flow (and in particular the 
shear stress, the component parallel to the vessel wall). 
Through mechanotransduction, cells align themselves 
with the flow direction; furthermore, regions of abnor- 
mally low and oscillatory wall shear stress are sites at 
which the arterial disease atherosclerosis tends to orig- 
inate. In a coronary artery, this can be a precursor to a 
stenosis (a blockage), which can in turn cause angina 
and possibly an infarction (a heart attack). Since flow 
patterns are intimately associated with vessel shape, 
geometry becomes a direct risk factor for disease. 


(a) (b) 



Figure 1 (a), (b) Streamlines of steady flow in an expanding 
channel, mimicking flow past an obstruction in an artery; 
the flow is from left to right, (a) The symmetric state S. 
(b) One of two asymmetric states A ± , for which a large recir- 
culation region exists on one or other wall, (c) S loses stabil- 
ity to A± via a so-called pitchfork bifurcation [IV.21 §2] 
as the flow strength increases (the dashed line indicates an 
unstable solution), (d) A small geometric imperfection will 
bias the system, preferentially connecting S to one of the 
asymmetric states. 

Arterial flows are characterized by moderately high 
Reynolds numbers, implying that inertia dominates vis- 
cosity (friction) but without normally undergoing tran- 
sition to full turbulence. The associated solutions of 
the Navier-Stokes equations nevertheless exhibit fea- 
tures characteristic of a nonlinear system, such as 
nonuniqueness and instability. Small changes in geom- 
etry (or other external factors) can therefore lead to 
large changes in the internal flow properties, as illus- 
trated in figure f. If modeling predictions are to help 
clinical decision making, it therefore becomes neces- 
sary to obtain detailed geometric information about an 
individual's arterial geometry, from X-ray tomography 
or magnetic resonance imaging, and use this as a geo- 
metric template on which to perform computational 
fluid dynamics. Even then, the robustness of predic- 
tions must be considered carefully given the natural 
variability of input conditions and geometric parame- 
ters, such as occurs between states of vigorous exercise 
and sleep. 

In smaller blood vessels, of diameters befow a few 
hundred microns, the discrete nature of the suspension 
becomes important and the description of blood as a 
Newtonian fluid breaks down. For practical purposes, a 
continuum description may still be useful but a num- 
ber of unusual features emerge. First, it is necessary to 
consider the transport of red cefls in addition to that of 
the suspension as a whole. This requires a distinction 
to be drawn between tube hematocrit Hi (the volume 
fraction of red cells in a tube flow) and discharge hema- 
tocrit tin (the cross-sectionally averaged cell concen- 
tration weighted by the axial flow speed u(r)). These 
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quantities differ because cells move away from the ves- 
sel wall into faster-moving parts of the flow through 
hydrodynamic interactions that are influenced by cell 
deformability. This has some important consequences 
beyond increasing Ho relative to Hr: the peripheral cell- 
free layer lubricates the core and lowers the effective 
viscosity of the suspension (estimated experimentally 
by measuring na 4 Ap /8LQ, following (1)); and, at ves- 
sel bifurcations, plasma rather than cells may be pref- 
erentially drawn into side branches down which there 
is a weak flow (so-called plasma skimming). 

The reduction in effective viscosity is of great impor- 
tance in ensuring that blood can pass through the 
smallest capillaries without requiring enormous pres- 
sure drops to push them through and without generat- 
ing huge shear stresses on endothelial cells. Red cells 
are sufficiently deformable that they can enter capillar- 
ies as small as 3 microns in diameter. Effects such as 
the dependence of effective viscosity on hematocrit and 
tube diameter have been carefully measured empiri- 
cally, but for many years these resisted a full theoretical 
explanation. However, the problem of simulating the 
motion of multiple deformable cells within a tube has 
recently become computationally tractable, enabling 
“bottom-up” predictions to be made of the properties 
of whole blood starting from the properties of its com- 
ponents. This is an important precursor to understand- 
ing interactions with other cells, such as neutrophils 
that adhere transiently to the walls of blood vessels as 
part of the inflammatory response and platelets that 
form aggregates during the clotting process following 
a wound. 

4 Tissue Growth 

Growth of biological tissues takes a variety of forms. 
Soft tissues, such as those in a developing animal 
embryo, are formed of clumps or sheets of cells. The 
cells can expand (by increasing their volume), divide 
(duplicating genetic material in the cell nucleus and 
introducing new cell boundaries), and reorganize (by 
changing their neighbors). The process of proliferation 
(increase of cell number) can be accompanied by dif- 
ferentiation (in biological terminology, this describes 
the specification of the cell's type and that of its 
progeny). These processes are often exquisitely con- 
trolled in order that the embryo develops the right 
organs in the right place at the right time. Growth is 
therefore intrinsically coupled to a host of compet- 
ing biological processes that “pattern” the developing 


organism. In addition to biochemical signals, cells have 
the capacity to respond to mechanical signals, enabling 
growth-induced stresses to feed back on further devel- 
opment. The growing organism is necessarily shaped 
by the physical forces it experiences, as was famously 
recognized a century ago by D’Arcy Thompson. 

Even when fully grown, a biological organism oper- 
ates in a state of homeostasis, with its tissues under- 
going continual replacement and renewal. This is par- 
ticularly striking in epithelia, the mucosal interface 
between the body and the external environment. For 
example, in the face of an onslaught of toxins, new 
cells in the lining of the gut are continually produced 
and removed. This rapid cell turnover predisposes such 
tissues to mutations that may lead to cancer. Tissues 
with structural properties, such as bone, muscle, or the 
walls of blood vessels, also undergo turnover, but at 
a slower rate and in a manner that responds to the 
biomechanical environment. Remodeling of bone has 
been described by Wolff’s “law” of 1872, whereby the 
architecture of the trabecular matrix that constitutes 
bone aligns with the principal axes of the local stress 
tensor. 

The structural rigidity of plant cells is provided pri- 
marily by a stiff cell wall. The primary cell wall is made 
of a pectin matrix, reinforced by fibers made of cel- 
lulose. When the fibers are uniformly aligned within 
the wall, it becomes mechanically anisotropic, being 
stiff in the direction parallel to the fibers but relatively 
soft in the perpendicular direction. A green plant is 
able to support itself against gravity and wind stresses 
by exploiting osmosis. By concentrating intracellular 
solutes, the cells draw water into intracellular vacuoles, 
pressurizing them (to levels comparable to a car tire) 
and generating tensions in the stiff cell walls. Then, by 
ensuring that cells adhere tightly to their neighbors, the 
rigidity of individual cells is conferred on the tissue as a 
whole. Even then, rapid growth is still possible. By reg- 
ulated softening of cell walls, cells can elongate (in the 
direction orthogonal to the embedded fibers), driven 
by the high intracellular pressures. In order to grow to 
large heights, or to survive hostile environments, plants 
may develop a stiff er secondary cell wall, reinforced by 
lignin. Growth of a woody structure (such as thicken- 
ing of a tree trunk or branch) occurs near its periph- 
ery in a thin layer beneath the bark known as the cam- 
bium. A horizontal branch can remodel itself to sup- 
port its increasing weight by laying down new mate- 
rial known as reaction wood , either in tension within its 
upper surface or in compression on its lower surface. 



614 


V. Modeling 


These are manifestations of residual stress, a com- 
mon feature of growing tissues. This is revealed when 
a piece of tissue is excised. A segment of artery, sliced 
axially along one wall, will spring open from an O- 
shaped to a C-shaped cross section. Similarly, a trans- 
verse slice across the stem of a green plant may cause 
the inner tissues to swell more than the periphery. One 
cause of such behavior is differential growth, for exam- 
ple, the inner part of the stem seeks to elongate more 
than the outer tissue. However, in the intact tissue, 
such differences are hidden by the requirement that 
the tissue remains continuous. 

To model growth and residual stress at a continuum 
level, a popular formulation involves decomposition of 
the deformation gradient tensor F into the product 
of two tensors as F = AG. (Here, the deformation is 
described via the map X — ■ x, where X labels points 
of the material in a reference configuration, while x 
denotes the location of such points in space; F satis- 
fies Fij = dxi/dXj.) Growth is prescribed through the 
tensor G, which maps the reference configuration to 
an intermediate state. This state may violate compati- 
bility conditions (e.g., two different parts of the body 
might map to the same location under G). However, a 
further deformation (A) is then required, mapping the 
intermediate state to the final state that accommodates 
both growth and the physical constraints placed on the 
body. This deformation might then be chosen to satisfy 
conditions of nonlinear elasticity. Growth that evolves 
in time can then be described by allowing G to vary with 
time. 

While residual stress may be hard to observe or mea- 
sure without an invasive procedure, differential growth 
can often have a striking effect on the morphology of 
growing structures, particularly when they are thin in 
one or more directions. Elongated plant organs such 
as roots or shoots display a variety of tropisms (bend- 
ing or twisting responses arising from signals such as 
light, gravity, or a nutrient). This can be achieved by 
ensuring that the cells on one side of the organ elon- 
gate slightly faster than those on the other, generating 
a bend. This enables a root penetrating a hard soil to 
navigate into the soft gaps between stones, as part of 
a thigmotropic (touch-sensitive) response, for example. 
While plant growth generally involves irreversible (vis- 
cous) deformations of plant cell walls, reversible defor- 
mations can also arise; in hygromorphic structures, 
adjacent layers of material elongate or contract differ- 
ently in response to changes in humidity. This allows 


pine cones or pollen-bearing anthers to open or close 
within minutes. 

Sheet-like structures such as leaves provide dramatic 
demonstrations of how nonuniform in-plane growth 
can generate exotic shapes. For example, the folds at 
the edge of a lettuce leaf can exhibit a cascade of 
wrinkles: just as in a pleated curtain, crinkly short- 
wavelength folds at the top give way to smoother 
longer-wavelength folds at the bottom. Thin objects are 
much easier to bend or twist than to stretch. Thus a 
leaf that undergoes growth primarily near its edge will 
tend to adopt configurations in which stretching is min- 
imized, which is achieved through out-of-plane wrin- 
kling. The separation of length scales in a thin sheet 
(with thickness much less than width) allows the sheet 
to be described using the Foppl-von Karman equations 
of shell theory (an eighth-order nonlinear PDE system), 
modified to account for nonuniform growth. Instead of 
the tensor G, the growth pattern at the edge of a leaf 
can be prescribed through a non-Euclidean metric of 
the form 

d5 2 = (1 + g(y)) 2 dx 2 + dy 2 , 

where x measures distance along the leaf edge in an 
initially undeformed planar configuration, and y mea- 
sures distance normal to the leaf edge, the leaf lying 
in y ^ 0. Here, g(y) is assumed to be nonnegative 
and to approach zero as y increases away from the 
leaf edge at y = 0. The function g therefore defines 
the Gaussian curvature K associated with the metric 
(K » -g"(y) ^ 0, for small g)\ recall that, because 
K * 0, both principal curvatures of the surface must be 
nonzero. However, from Gauss’s Theorema Egregium, 
K is invariant when the surface undergoes isometric 
deformations. The shape of a leaf, which may exhibit a 
self-similar cascade of wrinkles, can therefore be inter- 
preted as an embedding of the surface in three dimen- 
sions in which there is no (or at least minimal) stretch- 
ing. The out-of-plane deformation relieves the residual 
stress that would accumulate were the leaf confined to 
a plane. 

5 Swimming Microorganisms 

Single-celled microorganisms constitute a major pro- 
portion of global biomass. They occupy diverse habi- 
tats, from algae in deep oceans to acid-tolerant bacteria 
in our stomachs. For such organisms, locomotion can 
be essential in securing nutrients, given the limitations 
of diffusion as a transport mechanism. For organisms 
that are just a few microns in length, propulsion in a 
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liquid environment is a battle against friction, inertia 
being negligible at very small length scales (the rele- 
vant Reynolds number [IV.28 §3.2] is very small). This 
demands specific swimming strategies. The cells are 
equipped with protruding appendages (cilia or flagel- 
lae). These may be driven in a wavelike fashion (such 
as synchronized waves in arrays of beating cilia on the 
alga Volvox), as a simple “breaststroke” (by a pair of 
flagellae on the alga Chlamydomonas nivalis), or as a 
helical propeller ( Escherichia coli uses a bundle of flag- 
ellae that are driven by rotary motors embedded in 
its cell membrane). To generate a net forward motion, 
the appendages must be driven in nonreversible paths. 
This is a consequence of the linearity of the Stokes 
equations (the Navier-Stokes equations without the 
nonlinear inertial terms), from which it can be proved 
that reversible boundary motions will generate cyclic 
motion with no net drift. 

The flagellar corkscrew of Escherichia coii satis- 
fies this constraint admirably. However, the bacterium 
must also be able to direct itself toward a nutrient. It 
achieves this by interspersing “runs” with “tumbles,” 
during which it reverses the rotation of one of the 
motors driving the flagellar bundle. These uncoil and 
the bacterium rotates to a new random orientation 
prior to the next run. The bacterium can bias its motion 
by prolonging the duration of runs in a favored direc- 
tion. (Given the difficulty of measuring a spatial gra- 
dient over a short body length, the microswimmers 
are likely to measure concentration differences as they 
move from one region to another.) Thus Escherichia 
coli undertakes chemotaxis (seeking out a nutrient) 
via a form of biased random walk [11.25]. Models 
of such processes must often be based on empiri- 
cal descriptions of observed biological behavior rather 
than established physical principles. 

Other species follow gradients in light (phototaxis) 
or oxygen (oxytaxis). In a dense suspension of the 
bacterium Bacillus subtilis beneath an air-liquid inter- 
face, accumulation of bacteria in the oxygen-rich region 
immediately beneath the interface wall increase the 
density of the suspension relative to deeper regions 
of the liquid. The resulting heavy-over-light arrange- 
ment is unstable and the excess weight of the cell-rich 
liquid drives an instability in which dense plumes of 
cells will fall downward. This continual overturning of 
liquid is known as bioconvection and has many analo- 
gies with thermal or solutal convection in a liquid. It is a 
striking example of the generation of a large-scale pat- 
tern via the collective motion of individual microscopic 


swimmers. Mathematical models describing this pro- 
cess can involve coupled advection-diffusion PDEs for 
two scalar fields (the number of cells per unit volume 
n(x, t) and the oxygen concentration c(x, t)) coupled 
to the Navier-Stokes equation for the average fluid 
velocity u(x,t ), subject to the incompressibility con- 
straint V ■ u = 0. The cells are advected by the flow 
and move chemotactically toward the oxygen or light 
source; the buoyancy of the cells acts as a body force 
on the fluid, driving convective motions. 

Algae such as Chlamydomonas nivalis exhibit gyro- 
taxis. These cells are bottom-heavy, and so experience 
a gravitational torque that aligns them head up. Thus, 
when they swim, propelled by their two flagellae, they 
move upward on average. However, in a region of shear 
(e.g., a vertical flow that varies horizontally), the cells 
experience a viscous torque that will rotate them away 
from the vertical. This leads to some striking phenom- 
ena, such as accumulation at the center of a pipe in 
a downward Poiseuille flow. Again, collective biocon- 
vective phenomena can then emerge. In this case, con- 
tinuum models track the mean swimming direction, 
represented by a unit vector p(x, t). A Fokker-Planck 
equation for the probability density Q (x,p,t) (a dis- 
tribution over position and orientation) can be used 
to account for reorientation of cells by hydrodynamic 
torques and rotational diffusion, which is coupled to 
transport equations for n and u. 

Escherichia coli propel themselves from the rear, with 
the propulsive force being balanced by the resistive 
drag toward the front of the body. When viewed from 
afar, the generated flow can be approximated by that 
of a force dipole (point forces directed in opposite out- 
ward directions) in an arrangement known as a pusher 
(figure 2). The two forces, pushing fluid away from 
the body parallel to its axis, draw fluid inward from 
the sides. In contrast, Chlamydomonas nivalis is pro- 
pelled by flagellae at the front of the body, with the 
drag arising behind. Viewed from afar, these are there- 
fore pullers', the force dipole pushes fluid out sideways 
from the body. In a straining flow that tends to ori- 
ent the swimmer in a particular direction, pushers wall 
generate a flow that reinforces their orientation but 
pullers will do the opposite. A suspension of pushers 
is therefore more likely to exhibit large-scale collective 
motion known as bacterial turbulence, driven by the 
chemical energy that powers the swimming organisms. 
This has motivated the development of model equa- 
tions that draw on earlier studies of flocking behavior, 
including extensions of the Navier-Stokes equations 
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(a) 



Figure 2 Two classes of swimming microorganisms: (a) a 
pusher and (b) a puller. Short arrows show the force dipole, 
solid arrows show the flow induced by swimming, and 
dashed arrows show an external straining flow that orients 
the swimmer. 

that incorporate additional nonlinear and higher-order 
terms that capture a variety of spatio temporal patterns. 

6 Coda 

This brief summary of mathematical modeling in 
biomechanics has necessarily been selective. The in- 
creasingly vigorous engagement of mathematicians 
with biologists, bioengineers, and biophysicists is yield- 
ing significant advances in many underexplored areas, 
generating novel mathematical questions alongside 
new insights into biomechanical processes. Interac- 
tions between mathematics and computation have been 
necessary and effective, particularly in modeling the 
interaction of diverse processes spanning disparate 
length scales. However, major challenges remain in 
learning how to cope with sparse data, natural variabil- 
ity, and disorder — challenges that will require increas- 
ing use of statistical and probabilistic approaches. 
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1 Introduction 

Mathematics and physiology have enjoyed a long his- 
tory of interaction. One aspect in which mathematics 
has proved useful is in the search for general principles 
in physiology, which involves organizing and describ- 
ing the large amount of data available in more compre- 
hensible ways. Mathematics is also helpful in the search 
for emergent properties, that is, in the identification of 
features of a collection of components that are not fea- 
tures of the individual components that make up the 
collection. This article considers a small selected set 
of mathematical models in physiology, showing how 
physiological problems can be formulated and stud- 
ied mathematically and how such models give rise to 
interesting and challenging mathematical questions. 

2 Cellular Physiology 

The cell is the basic structural and functional unit of a 
living organism. Collectively, cells perform numerous 
functions to sustain life. Those functions are accom- 
plished through the biochemical reactions that take 
place within the cell. 

2.1 Biochemical Reactions 

Chemical reactions are “governed” by the law of mass 
action, which describes the rate at which chemicals 
interact to form products. Suppose that two chemicals 
A and B react to form C: 

A + B ic. 

The rate of formation of the product C is proportional 
to the product of the concentrations of A and B, as well 
as the rate constant k, i.e., 

d j^ = k[A][B]. (1) 

at 

For a reversible reaction 

A + B h C, 

k _ 
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where k+ and k_ denote the forward and reverse rate 
constants of reaction, the rate of change of [A] is given 
by 

^ = fc_[C]-k + [A][B]. 

If k- =0, then the above reaction reduces to (1) with 
k = k+ and d[A]/dt = -d[C]/dt (from the reaction, 
the rate of consumption of A is equal to the rate of 
production of C). 

Some important biochemical reactions are catalyzed 
by enzymes, many of which are proteins that help con- 
vert substrates into products but themselves emerge 
from the reactions unchanged. Consider the following 
simple reaction scheme, where an enzyme E converts 
a substrate A into the product B through a two-step 
process: 

A + E ^ C — B + E. 
k - 1 

First, E combines with A to form a complex C, which 
then breaks down into the product B and releases E. 
This model is known as Michaelis-Menten kinetics. 

We will determine the reaction rate, given by the rate 
at which the product is formed (i.e., d[B]/dt), by per- 
forming an equilibrium approximation. Let us assume 
that the substrate A is in instantaneous equilibrium 
with the complex C; thus, 

ki[A][E] = k_ 1 [C]. (2) 


Note that the total amount of enzyme is unchanged, 
i.e., [E] + [C] = Eq, where Eq is a constant. Substituting 
this relation into (2) and rearranging yields 


[C] 


Eq[A] 
Em + [A] ’ 


where K m = k-i / fci is known as the Michaelis constant, 
which is the substrate concentration at which the reac- 
tion rate is half of its maximum. The overall reaction 
rate is given by 


d[B] 

dt 


k 2 [C] = 


i'niax [A] 

E m + [A] ’ 


where V m ax = k^Eo is the maximum reaction rate. 
The above relation describes Michaelis-Menten kinet- 
ics, which is one of the simplest and best-known mod- 
els of enzyme kinetics, named after biochemist Leonor 
Michaelis (1875-1949) and Canadian physician Maud 
Menten (1879-1960). A sample reaction curve is shown 
in figure 1 . 


2.2 Membrane Ion Channels 

All animal cells are surrounded by a membrane com- 
posed of a lipid bilayer with proteins embedded in it. 



Figure 1 An example Michaelis-Menten reaction rate curve, 
showing reaction rate as a function of substrate concentra- 
tion, with parameters V max = 3.5 and K m = 0.5. 


Found on most cellular membranes are ion channels, 
which are macromolecular pores that passively carry 
specific ions across the membrane. One of the driv- 
ing forces for transport is ionic concentration gradient. 
Another is the transmembrane potential difference, 
which is the difference in electrical potential between 
the interior and exterior of the cell. 

Ion channels are characterized by their current- 
voltage relationship, the parameters of which depend 
on the biophysical properties of the channel. The cur- 
rent I across a population of N channels can be written 
as 

I = NP 0 (V,t)i(V,t), 

where P 0 is the fraction of open channels at time t 
(0 ^ P 0 ^ 1), and i is the current across a single 
open channel. Note that the above expression assumes 
that both are functions of the transmembrane poten- 
tial difference V. Below we describe common models 
for i(V, t) and, subsequently, for P 0 (V, t). 

The two most common models of current-voltage 
relationships are the modified form of Ohm’s law and 
the Goldman-Hodgkin-Katz equation. The modified 
Ohm’s law states that the flow of ions S across the 
membrane is a linear function of the difference between 
the membrane potential and the equilibrium (or Nernst) 
potential E of the solute S. It can be written as 

is = ds(V - E), 

where g% is the channel conductance, which is not nec- 
essarily constant. Indeed, it can vary with concentra- 
tion, voltage, or other factors. 

As an example, the conductance of inward rectify- 
ing potassium channels (Kir), which play an important 
role in maintaining the resting membrane potential, is 
a function of V. The activity of Kir channels is also 
modulated by the extracellular K + concentration (C£). 
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Experimentally, the Kir conductance is often found to 
increase with the square root of C£. Thus, the current 
across a Kir channel, denoted by iior, can be expressed 
as 


iKir = gniviy ~ Ek), 



1 + exp((V - Viarl/feKlr) ’ 


where g ^ is a constant, Vxir is the half-activation 
potential, and kKir is a slope. The dependence of c/Kir 
on the membrane potential stems from the presence 
of gating charges. 

Other channels are better represented by the Gold- 
man-Hodgkin-Katz current equation, which can be 
obtained by integrating the Nernst-Planck equation 
assuming a constant electric held across the mem- 
brane. The Nernst-Planck equation gives the flux of ion 
S when both concentration and electrical gradients are 
present: 



The first term corresponds to Fick’s law, a constitutive 
equation that describes diffusion driven by a concen- 
tration gradient; Ds is the diffusivity of S. The second 
term corresponds to diffusion due to the electric field; 
the electric force exerted on a charged solute increases 
linearly with the electric field (E = - Vip). To derive the 
Goldman-Hodgkin-Katz equation for the simple one- 
dimensional case, we assume that the electrical field is 
constant across the membrane (i.e., dji/dx = - V/L m , 
where L m denotes the length of the membrane) and 
obtain 



where £ = zsFV m /RT. Since Js is constant, the above 
equation can be transformed into 

fit's § c Js^ 

dx L m D s 

and then integrated to yield Cs(x): 


, ( £,X \ ,/s f- m 

Cs ( x)= fl exp(-j + — 


To determine the integration constant a and the 
unknown Js, we use the two boundary conditions 

C S (0) = a + = C l s , 

C s (I m ) = aexp(5)+^ = Cf, 


where C'l and C| denote the solute concentrations 
on the two sides of the membrane. These boundary 


conditions yield the Goldman-Hodgkin-Katz current 
equation 

a= C s~ C l 

1 - exp(g) ’ 

= Ds zsFV Cg - C| exp(-z s FV ART) 

Js L m RT 1 - exp (-z s FV/RT) ’ 
where the definition of § has been applied. 

From the Goldman-Hodgkin-Katz equation one can 
obtain the Nernst potential, which yields zero flux when 
the transmembrane concentration gradient is nonzero, 
by setting Js to zero and solving for V: 



As noted above, the current flowing across a popula- 
tion of channels is proportional to the number of open 
channels ( NP 0 ). In excitable tissues such as smooth 
muscles, channels open in response to stimuli, such as 
adenosine- 5 '-triphosphate (a nucleoside triphosphate 
used in cells as a coenzyme, often called the “molec- 
ular unit of currency” of intracellular energy transfer), 
Ca 2+ concentration, or voltage. The simplest channel 
model assumes that the channel is either in the closed 
state C or in the open state O, 



where a and /I denote the rates of conversion from one 
state to the other. If n is the fraction of channels in the 
open state, then 1 - n is the fraction of channels in the 
closed state and we have 
d n 

= a(l - n) - Bn = ck - (a + B)n. 
df 

The characteristic time of this equation is t = 1 / ( cx+/? ) , 
and the steady-state value of n is n„ = a/(« + P). The 
equation can be rewritten in terms of t: 

dn n oo — n 
dt t 

In general, and t are voltage or concentration 
dependent, so the above equation cannot be solved ana- 
lytically. Voltage-sensitive channels respond to electric 
potential variations; their voltage sensors consist of 
“gating” charges that move when V is altered, thereby 
resulting in a conformational change. For these chan- 
nels, Jico can be described by a Boltzmann distribution: 

1 

n “ " 1 + k 0 exp(bFV/RT) ’ 
where b is a constant related to the number of gating 
charges on the channel and the distance over which 
they move during a conformational change. 
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3 Excitability 


Membrane potential is used as a signal in cells such 
as muscle cells and neurons. Some of these cells are 
excitable. If a sufficiently strong current is applied, 
the membrane potential goes through a large excur- 
sion known as an action potential before eventually 
returning to rest. We will study one example of cellular 
excitability: the Morris-Lecar model. 

The Morris-Lecar model is a two-dimensional “re- 
duced” excitation model (reduced from the four-di- 
mensional Hodgkin-Huxley model) that is applicable to 
systems having two noninactivating voltage-sensitive 
conductances. The model grew out of an experimen- 
tal study of the excitability of the giant muscle fiber 
of the huge Pacific barnacle, Balanus nubilus. Synaptic 
depolarization of these muscle cells leads to the open- 
ing of Ca 2+ channels, allowing external Ca 2+ ions to 
enter deep into the cell interior to activate muscle con- 
traction. The Morris-Lecar model describes this mem- 
brane with two conductances, Ca 2+ and K + , the inter- 
play of which yields qualitative phase portrait changes 
with small changes in experimental parameters, such 
as the relative densities of Ca 2+ and K + channels or 
the relative relaxation times of the conducting systems. 
This simple model is capable of simulating the entire 
panoply of (two-dimensional) oscillation phenomena 
that have been observed experimentally. Parameter 
maps in phase space can then be drawn to identify and 
classify the parametric regions having different types 
of stability. 

The model equations describe the membrane poten- 
tial (V) and the fraction of open potassium channels 
(n): 

* -0L(V-V L )-0Cameo(V)(V- Vca) 

-g K n(V-V K ), (4) 

dn / v — Vi \ 

— = <Kcosh(^^^J(neo(V)-n). (5) 

The first term on the right-hand side of (4) is the leak 
current, where Vi denotes the associated Nernst rever- 
sal potential. The second term is the calcium current, 
where gca is the maximum whole-cell membrane con- 
ductance for the calcium current. This calcium cur- 
rent arises from the voltage-operated calcium chan- 
nels, where represents the fraction of open channel 
states at equilibrium. Based on experimental data, 
is described as a function of membrane potential v by 


m.co(V) = 0.5 


^1 + tanh 


( V~Vi 
V V 2 


))■ 


Table 1 Model parameters used in the bifurcation study. 


c 

20 pF 

Vi 

-1.2 mV 


2 nS 

v 2 

18 mV 

PCa 

4 nS 

V K 

-85 mV 

3 k 

8 nS 

V3 

12 

Vl 

-60 mV 

v 4 

17 

Vca 

120 mV 

4*11 

0.06666667 


where Vi is the voltage at which half of the channels are 
open and V 2 determines the spread of the distribution 
of open calcium channels at steady state. For very neg- 
ative V, tanh((V - Vi)/V 2 ) 1 and m — 0, which 

implies that almost all calcium channels are closed. For 
large (more positive) V, tanh((V — Vi )/V 2 ) — 1 and 
m oo — 1 , which implies that most calcium channels are 
now open. 

The third term in (4), -p K ti(V - Vk), represents 
the transmembrane potassium current induced by the 
opening of potassium channels. In (5), n ro denotes 
the fraction of open K + channel at steady state; this 
fraction depends on the membrane potential V through 

noo(V) = 0.5^1 + tanh ^ V j j , 

which has a form similar to the equilibrium distribu- 
tion of open Ca 2+ channel states (moo). The potential 
V 3 determines the voltage at which half of the potas- 
sium channels are open; V 4 and Ca 4 are measures of the 
spread of the distributions of and V 3 , respectively. 
Note also that both the potassium current (which we 
denote by Ik = -g^niV - Vk)) and the calcium current 
(denoted by I Ca = - 0 ca»too(V)(V - V Ca )) depend on 
the membrane potential V, and both currents in turn 
change the membrane potential. Parameters for this 
model are given in table 1 . 

The model predicts drastically different behaviors 
depending on the initial conditions. When the mem- 
brane potential v is initialized to -20 mV and n is 
initialized to 0, the voltage and the currents Ik and 
lea decay to rest (see figure 2). Qualitatively different 
behaviors are predicted with u(0) = -10 mV: the volt- 
age rises substantially before decaying to rest (see fig- 
ure 3(a)). To gain insight, one may consider the cur- 
rents, which are shown in figure 3(b). The magnitude 
of both currents increases before decaying to rest. An 
interesting feature of this system is that the two cur- 
rents go in opposite directions: Ik is outward-directed 
and thus hyperpolarizes the cell, whereas Ic a is inward- 
directed and depolarizes the cell. Because the calcium 
channels are voltage gated, once the voltage crosses a 
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Figure 2 Solution to the Morris-Lecar model with v initial- 
ized to -20 mV: (a) voltage; (b) K + and Ca 2+ currents. The 
voltage and current both decay to zero. 


threshold there is a large inward-directed calcium cur- 
rent. (This is therefore an example of an excitable cell.) 
The current causes the voltage to increase, and eventu- 
ally the current slows down and reaches its peak. The 
large outward potassium current then comes into play, 
repolarizing the cell. 

To understand this threshold behavior for excitable 
cells, one may study the nullclines of the system, which 
are curves where dv/dt = 0 and dn/dt = 0. The null- 
clines are depicted in figure 4. The first thing to note 
is that the two nullclines intercept at three equilibrium 
points. If the system is initialized at those v and n val- 
ues, then the system will remain at equilibrium. The two 
nullclines divide the v-n space into several regions, 
in which v and n are increasing or decreasing. The 
two previous simulations that produce qualitatively 



f 



Figure 3 Solution to the Morris-Lecar model with v initial- 
ized to -10 mV: (a) voltage; (b) K + and Ca 2+ currents. The 
solution exhibits an initial bump (or crest) before decaying 
to rest. 

different behaviors correspond to initial v and n val- 
ues that lie in different regions. With u(0) = -20 mV 
and n(0) = 0 (indicated by the asterisk in figure 4), 
the system lies in the region where dv/dt < 0 and 
dn/dt > 0, so the voltage decays to rest (see figure 2(a)), 
whereas the gating variable n, after an initial increase, 
also decays to 0. With v (0) = - 10 mV (indicated by the 
open circle in figure 4), however, the system lies in the 
region where dv/df > 0 and dn/dt > 0, so the voltage 
rises to a peak before decaying to rest (see figure 3(a)). 

4 Kidneys 

A group of cells that have similar function or structure 
may join together to form tissues. A group of tissues 
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v 


Figure 4 The v- and n-nullclines divide the v-n space 
into regions where v and n are increasing or decreasing. 



0 x L 

Figure 5 Schematic of a model glomerular capillary. 


may then collectively form an organ. A collection of 
organs then forms organ systems, an example of which 
is the excretory system. 

The kidneys are part of the excretory system. They 
serve a number of essential regulatory roles. Be- 
sides their well-known function as filters, removing 
metabolic wastes and toxins from the blood and excret- 
ing them through urine, the kidneys also serve other 
essential functions. Through a number of regulatory 
mechanisms, the kidneys help maintain the body’s 
water balance, electrolyte balance, and acid-base bal- 
ance. Additionally, the kidneys produce or activate hor- 
mones that are involved in erythrogenesis, calcium 
metabolism, and the regulation of blood flow. 

4. 1 Glomerular Filtration 

The first step in the formation of urine is the filtration 
of blood by glomerular capillaries. The filtrate is col- 
lected into a Bowman’s capsule and subsequently flows 
through the tubular system, where it undergoes major 
changes in volume and composition. 

To model the glomerular fluid filtration process, 
the glomerular capillaries are idealized as a network 
of identical, parallel capillaries with homogeneous 


properties along their entire length. The capillaries are 
represented as rigid cylinders of radius r and length L 
(figure 5). Let P denote the plasma compartment and 
let Q p denote the volumetric rate of plasma flow in a 
capillary. At a given position x along the capillary, a 
portion of the flow can be reabsorbed into (or secreted 
from) the surrounding medium via a transversal flux 
across the capillary wall. Assuming that the capillaries 
are rigid, there is no accumulation of fluid within the 
plasma compartment. Thus, for a single capillary, the 
conservation of fluid can be written as 



where Jv is the plasma volume flux, expressed as flow 
per unit of capillary area and taken to be positive for 
outward-directed flux. Fluid flow across a semiperme- 
able membrane such as the capillary wall is driven by 
the hydrostatic, oncotic, and osmotic pressure differ- 
ences: 


Jv 


L p 


( a P -An-RT 



where L p is the hydraulic conductivity of the mem- 
brane; A P is the transmembrane difference in hydro- 
static pressure; A/7 is the oncotic pressure, due to the 
contribution of nonpenetrating solutes such as pro- 
teins; R and T are the gas constant and absolute tem- 
perature, respectively; cr s is the osmotic reflective coef- 
ficient of the membrane to solute S; and ACs is the 
transmembrane concentration gradient (outside minus 
inside) of solute S. 

To apply the above equation, which was derived for 
a single capillary, to one glomerulus, or to all the 
glomeruli of one or two kidneys, one writes 

dQ p S p 
~Ax = ~T Jv ' 

where S p denotes capillary surface area, L is the capil- 
lary length, and the definition of plasma flow rate (Q p ) 
must be adjusted accordingly. 

To determine water flux Jv, one needs to track the 
concentrations of small solutes and protein. To that 
end, consider the conservation of a solute S in plasma: 


d(Q p C p ) 

d.v 


-2-rtrJs, 


where Js is the plasma flux of species S, expressed as 
a molar flow per unit of capillary area. Note that the 
product Q P C P gives the plasma flow rate of solute S, in 
moles per unit time. 
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The transmembrane flux of an uncharged solute is 
given by the Kedem-Katchalsky equation 

is = iv( 1 - cr s )Cs + PsACs, 

where the first term represents the contribution from 
advection and the second term that from diffusion. Ps 
is the permeability of the membrane to solute S, and Cs 
is an average membrane concentration, which in dilute 
solutions is given by 

Cs AlnC s 

Some proteins such as albumin are highly concen- 
trated in plasma and thus exert a significant oncotic 
pressure, which opposes fluid filtration across the cap- 
illary wall. In healthy kidneys the fraction of plasma 
proteins that are filtered along the capillaries is negli- 
gible. Thus, if Cp r denotes the plasma concentration of 
proteins, we have 

d(Q p Cp r ) q 
Ax 

In summary, the set of equations that form the basis 
of models of glomerular filtration consist of equations 
that describe conservation of fluid and of solutes (small 
solutes and proteins). The boundary conditions are 
specified at the afferent end of the capillary: 

Q p (x = 0) = Q a , Cp r (x = 0) = Cp r , 

where Q A equals the afferent arteriolar plasma flow and 
Cp r equals the total protein concentration in the affer- 
ent plasma. A similar boundary condition is imposed 
for solute S. 

4.2 Urinary Concentration 

During water deprivation, the kidney of a mammal can 
conserve water by producing urine that is more con- 
centrated than blood plasma. This hypertonic urine 
is produced when water is reabsorbed, in excess of 
solutes, from the nephrons and into the renal vascu- 
lature, thereby concentrating the tubular fluid, which 
eventually emerges as urine. Here we will develop a 
model of the concentrating mechanism in an impor- 
tant segment of the renal tubular system, the loop of 
Henle. The loop of Henle is a hairpin-like tubule that 
lies mostly in the medulla and consists of a descending 
limb and an ascending limb (see figure 6). 

The model represents a loop with a descending limb 
and an ascending limb. The two limbs are assumed to 
be in direct contact with each other. We make the sim- 
plifying assumption that the descending limb is water 


Isotonic fluid Dilute fluid 



Figure 6 Countercurrent multiplication by NaCl transfer 
from an ascending flow to a descending flow; the concen- 
tration of the descending flow is progressively concentrated 
by NaCl addition. 

impermeable but infinitely permeable to solute. This 
results in the conservation equations 

JtQdl(x) = 0, 

^(Qdl(x)Cdl(x)) = -2TTrDLiDL,j- 

The notation is similar to the glomerular filtration 
model, with the subscript “DL” denoting the descend- 
ing limb. 

We assume that the ascending limb is water imperme- 
able, and that the solute is pumped out of the ascend- 
ing limb at a fixed rate A. (A more realistic description 
of this active transport would be the Michaelis-Menten 
kinetics discussed in section 2.1. Simplistic, fixed-rate 
active transport is assumed here to facilitate the analy- 
sis.) Additionally, we assume that all of that solute goes 
into the descending limb. Thus, 2nrnLjDL,s = -A and 
2TrrAi/AL,5 = A. The conservation equations for the 
ascending limb (denoted “AL”) are 

^Qai(x)=0, ^(Qal(x)Cal(x)) = -A. 

Because the descending and ascending limbs are as- 
sumed to be contiguous, at the loop bend ( x = L) we 
have 

Qdl(D = -Qal(D, Cdl(D = Cal (I). 
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Finally, to complete the system, boundary conditions 
are imposed at the entrance to the descending limb: 

Qdl(O) = Qo, Cdl(O) = Co- 

To determine Cdl(x) and Cal(x), we note that, since 
the entire loop is water impermeable, Qdl(x) = 
Qal(x) = Qo- Thus, 

rs 

Qo^-Cdl(x) = A, 
ox 

which can be integrated to yield 
Cdl(x) = Co + 

To compute Cal(x), we evaluate Cdl at x = L and use 
that as the initial condition for the ordinary differential 
equation for Cal(x) to get 

Cal(x) = Cdl (I) - (^ ) (!-*) = Cdl(x). 

The simple model illustrates the principle of coun- 
tercurrent multiplication, by which a transfer of solute 
from one tubule to another (called a “single” effect) aug- 
ments (“multiplies,” or reinforces) the axial osmolality 
gradient in the parallel flow. Thus, a small transverse 
osmolality gradient difference (a small single effect) 
is multiplied into a much larger osmolality difference 
along the axis of tubular flow. To summarize, the mode 
predicts that the concentrations along both limbs are 
the same at any given x, and that solute concentration 
increases linearly along x. Thus, the longer the loop, 
the higher the loop bend concentration. 

Further Reading 

Due to space constraints, this article focuses on cel- 
lular transport and the kidneys. Mathematical models 
of other aspects of the kidney can be found in Layton 
and Edwards (2013). Mathematical models have been 
formulated for other organ systems, including the cir- 
culatory system, the digestive system, the endocrine 
system, the lymphatic system, the muscular system, 
the nervous system, the reproductive system, and the 
respiratory system. For these models, see Keener and 
Sneyd (2008). 

Keener, J., and J. Sneyd. 2009. Mathematical Physiology, Vol- 
ume I: Cellular Physiology and Mathematical Physiology; 
Volume II: Systems Physiology. New York: Springer. 
Layton, A., and A. Edwards. 2013. Mathematical Modeling of 
Renal Physiology. New York: Springer. 


V.6 Cardiac Modeling 

Alexander V. Panfilov 


1 The Heart, Waves, and Arrhythmias 

The main physiological function of the heart is mechan- 
ical: it pumps blood through the body. The pumping is 
controlled by electrical excitation waves, which propa- 
gate through the heart and initiate cardiac contraction. 
Anatomically, the human heart consists of four cham- 
bers. The two lower chambers, which are called ventri- 
cles, have thick (1-1.5 cm) walls, and it is contraction 
of the ventricles that pushes blood through the body. 
The upper two chambers, called the atria, have thin 
walls (about 0.3 cm thick). The atria collect the blood, 
and their contraction delivers it to the ventricles. Under 
normal conditions, excitation of the heart starts at the 
sinoatrial node located in the right atrium (figure 1). 
The cells in the sinoatrial node are oscillatory, and they 
periodically initiate excitation waves. Thereafter, the 
wave propagates through the two upper chambers of 
the heart (the atria), causing atrial contraction. After 
some delay at the atrioventricular (AV) node, the exci- 
tation enters the ventricles and initiates the main event: 
ventricular contraction. 

Abnormal excitation of the heart results in cardiac 
arrhythmias, which have various manifestations. They 
might involve just one extra heartbeat initiated by a 
wave from another location, or they could manifest 
as an abnormally fast heart rate called tachycardia, 
which can occur in the atria or the ventricles. In some 
cases the excitation becomes spatially disorganized, 
which results in failure of contraction. If this occurs in 
the ventricles, cardiac arrest and sudden cardiac death 
result. Prediction and management of cardiac arrhyth- 
mias are therefore two of the greatest problems in mod- 
ern cardiology. Sudden cardiac death due to arrhyth- 
mias is one of the largest causes of death in the indus- 
trialized world, accounting for approximately one in 
every ten deaths. Cardiac arrhythmias have important 
consequences in the pharmaceutical industry too: more 
than half of all drug withdrawals in recent years have 
been the result of the drugs in question potentiating 
the onset of arrhythmias or causing other cardiac side 
effects. 

Mechanisms of the most dangerous cardiac arrhyth- 
mias are directly related to wave propagation. The 
properties of waves in the heart are quite different from 
those of most other types of nonlinear waves, the main 



624 


V. Modeling 


Sinoatrial node 



Figure 1 Schematic of the heart’s conduction pathway. 
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Figure 2 Schematic of (a) a wave propagating around 
an obstacle and (b), (c) spiral wave formation. 


difference being the presence of refractoriness. After 
excitation, a cardiac cell requires some time, called the 
refractory period, to recover its properties. During that 
refractory time it cannot be excited again, and vortices 
may appear in the heart as a result. Let us consider a 
wave propagating along an obstacle in cardiac tissue: 
around veins or arteries entering the heart, for exam- 
ple. If the travel time around the obstacle is longer than 
the refractory period, a sustained rotation (figure 2(a)) 
will take place. Such rotation produces periodic excita- 
tion of the heart with a frequency much faster than the 
frequency of the sinus node, and tachycardia occurs as 
a result. 

It has also been shown that, because of refrac- 
toriness, such rotations can also occur without any 
anatomical obstacles. Refractory tissue is unexcitable, 
and the wave can propagate around it as it would do 
around anatomical obstacles (figure 2(b)). The only dif- 
ference is that, as soon as the refractory period ends, 
the wave can enter this region (figure 2(c)). The wave- 
front therefore normally follows the refractory tail of 
the wave closely, and the period of rotation is very 
short. Such vortices are usually called spiral waves 



Figure 3 (a) Spiral wave and (b) complex excitation pattern 
after the process of spiral breakup in an anatomical model 
of human ventricles. (Figure created by I. V. Kazbanov and 
A. V. Panfilov.) 

because in large media they take the shape of rotating 
spirals (figure 3(a)). Other interesting types of dynam- 
ics include a breakup of spiral waves into complex 
spatiotemporal chaos (figure 3(b)). If this occurs in 
the heart, excitation becomes spatially disorganized, 
resulting in failure of organized cardiac contraction 
and sudden cardiac death. 

Overall, we can say that the sources of abnormal exci- 
tation of the heart are not located in a single region but 
are the result of wave circulation involving millions of 
cardiac cells. Thus, if we want to understand the mech- 
anisms of cardiac arrhythmias, we need to know how 
changes at the level of the individual building block 
of the heart (i.e., the single cell) will manifest them- 
selves at the whole-organ level. It is in answering pre- 
cisely these questions that mathematical modeling can 
be useful. Let us consider the main principles behind 
cardiac models. 

1.1 Modeling Cardiac Cells 

In the resting state, there is a difference in the poten- 
tial between the two sides of the membrane of about 
-90 mV in cardiac cells. During excitation this volt- 
age rapidly increases to about +10 mV before slowly 
returning to its resting value (figure 4(a)). A typical 
cardiac action potential is shown in figure 4. It has 
a sharp upstroke that lasts only 1-2 ms and then a 
repolarization phase of about 300 ms. 

The voltage across the cardiac membrane changes as 
a result of the complex time dynamics of ionic currents 
through the membrane. The currents are conveyed by 
selective ionic channels that are permeable to various 
ions, the most important of which are Na + , K + , and 
Ca 2+ . The process of excitation of the cardiac cell can 
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Figure 4 (a) An action potential in the heart and (b) a 
schematic representation of a cardiac cell. /Na, /ca, etc., 
denote different types of ionic channels; arrows indicate 
current direction. 


be described by the following system: 

r d ^ 111 i r r 4 
L m -^— = “JNa - iK - JCa + ' ' ' , 

I* = G*g%g?(V. m - V*), 
d gj _ g; (Vm) ~ di ■_ 
dt t i(V m ) ’ 1 


( 1 ) 

( 2 ) 

(3) 


where the first equation describes the changes in trans- 
membrane voltage V m = Vm - V 0 ut as a result of the 
dynamics of various ionic currents. Each of the cur- 
rents typically depends on V m and time. In most cases, 
time dependency is given by an exponential relaxation 
equation for gating variables gi. For example, a hypo- 
thetical current /* that conveys ion and has a max- 
imal conductivity of G* = const, is represented by 
expression (2). The current is zero at V m = V - *, where 
V* is the so-called Nernst potential for ion which 
can be easily computed from the concentration of spe- 
cific ions outside and inside the cardiac cell. The time 
dynamics of this current are governed by two gating 
variables g*, g . raised to the powers a, fi. The vari- 
ables g^, g. approach their voltage-dependent steady 
state values gf (Vm) with characteristic time t ,■ (V m ) (3). 
All the parameters and functions here are chosen to 
fit experimentally measured properties of the specific 
ionic current. Most of the ionic currents have one or 
two gating variables, with a = /? = 1 . 

The systems expressed by (l)-(3) are called detailed 
ionic models as they describe in detail the underly- 
ing biophysical mechanism of cardiac excitation and 


are based on direct measurement of the properties 
of cardiac cells. Such models may have from as few 
as 4 to more than 100 equations and contain hun- 
dreds of parameters. These models are mainly solved 
numerically and can be applied to detailed studies 
of drug actions, mutations, and other processes on 
cardiac cells. Another class of cardiac models is low- 
dimensional phenomenological models, which have 
just two or three differential equations. These are 
mainly used for generic studies of waves in the heart, 
which can be numerical as well as analytical. 

1.2 Tissue and Whole-Organ Models 

Wave propagation is a result of the successive excita- 
tion of cardiac cells, each of which is described by (1). In 
cardiac tissue, the excitable cells are connected to each 
other via resistors called gap junctions and equations 
can be obtained by representing tissue as a resistive 
network. One of the most widely used models involves 
the monodomain equations: 

Cm-^f = div(DVVm) - /ion, (4) 

where D is the diffusion tensor. The idea behind this 
equation is straightforward. The total current through 
the membrane has not only an ionic current /i on 
but also diffusive currents from the cell’s neighbors 
(div(DVVm)). To understand why the diffusive current 
has such a form, consider a hypothetical cell that takes 
the form of a long cylinder. The current along the axis 
of the cylinder is proportional to the gradient of the 
voltage DVV m (Ohm's law). The divergence of this cur- 
rent is a membrane current, i.e., a current that goes 
through the surface of the cylinder, and it therefore has 
to be added to Ii 0n . As cardiac tissue consists of fibers, 
its resistance depends on the direction of these fibers, 
and this is accounted for by the diffusivity (conductiv- 
ity) tensor D. Recent studies have revealed that fibers 
in the heart are organized in myocardial sheets, giv- 
ing resistivity in three main directions: along the fibers, 
across the fibers inside the sheets, and across the 
sheets. The measured velocities of these three direc- 
tions have the approximate ratio 1 : ^ ^ . The diffusive 
matrix can be directly calculated from the orientations 
of the fibers and the sheets. The main idea is that in a 
coordinate system that is locally aligned with the fibers 
and sheets, the matrix D is diagonal, with eigenvalues 
accounting for unequal resistivity along given direc- 
tions. In a global coordinate system, D can be found 
by simple matrix transformation. 
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Modeling of excitation in the whole heart mostly 
involves finding a solution of (4) in the domain repre- 
senting the heart shape. To solve this equation we will 
also need to specify the anisotropy of the tissue at each 
point (via the matrix D) and the types of cardiac cells 
via models of cardiac cells (lion)- In addition, boundary 
conditions also need to be imposed. Most of these data 
are easy to obtain, and the shape of the heart can be 
obtained from computerized tomography [IV.4 §8] 
(CT) (see also [VII.19]) or magnetic resonance imag- 
ing [VII.10 §4.1] (MRI) scanning procedures, which are 
routinely used in clinical practice. However, obtaining 
D is much more difficult. Currently, it can be measured 
only on explanted hearts using direct histological mea- 
surements or by means of diffusion tensor MRI, which 
measures fiber directions by probing the diffusion rates 
of water in the tissue. 

2 Analytical Methods in Cardiac Modeling 

When it comes to the analytical study of cardiac mod- 
els, the possibilities are very limited. Most of the results 
in this area deal with low- dimensional models, for 
which the main properties of solutions can be stud- 
ied in the phase plane of the system. Another impor- 
tant direction of (semi)analytical research is the inves- 
tigation of dynamical instabilities, which may result 
in the generation of spiral waves and their decay into 
complex turbulent patterns. One of the most interest- 
ing results was obtained in studies of the so-called 
alternans instability, which is simply an instability that 
occurs as a result of a period-doubling bifurcation for 
a one-dimensional discrete map describing a periodi- 
cally forced cardiac cell. Such studies have predicted 
the necessary conditions for onset of the instability 
that leads to spiral breakup. These conditions were 
connected to measurable characteristics of cardiac tis- 
sue: the dependency of the duration of a cardiac pulse 
and the period of stimulation of cardiac cells. This 
hypothesis was tested not only in theoretical stud- 
ies but also in experimental research, with the lat- 
ter demonstrating that a decrease in the slope of the 
dependency can prevent the onset of electrical turbu- 
lence in the heart. Another recent development is a new 
way of describing anisotropy in the heart as a Rieman- 
nian manifold, whose metric is defined by the arrival 
of excitation at a given point. This viewpoint allows 
analytical equations to be obtained for wave veloc- 
ity and spiral wave dynamics in two-dimensional and 


three-dimensional anisotropic tissue via properties of 
curvature tensors used in Riemannian geometry, such 
as the Ricci curvature tensor. 


3 Numerical Approaches 

Numerical approaches are the most important tool 
in cardiac modeling. The monodomain equations that 
describe cardiac excitation belong to the class of 
parabolic partial differential equations, and the numeri- 
cal solution of these equations is straightforward. Most 
often, the simulations of (4) were made using explicit 
finite-difference methods, such as the Euler integration 
scheme. Equation (5) gives an example of this approach 
for the two-dimensional isotropic case, which updates 
the value of the voltage at each point (i,j) on a two- 
dimensional grid at time t + ht from the values of 
variables at time t: 

V t+ht = V-- - I- ■ 

IJ IJ u 

htD VLj + Vj J+ 1 + Vj-u + v! J+ 1 - 4 Vfj 

+ Cm hs2 

(5) 


Here, hs and ht are the spatial and time integration 
steps, and ly gives the value of the ionic current at a 
given point. Typical integration steps for cardiac mod- 
els are hs ~ 0.2 mm and ht ~ 0.01 ms. Due to the 
big difference in timescales for the upstroke and depo- 
larization phases (figure 4), many attempts have been 
made to develop algorithms that are adaptive in time 
and/or space. For whole-heart simulations, a big chal- 
lenge is the proper representation of boundary condi- 
tions on domains of complex shape. To this end, as 
well as explicit finite-difference methods, more com- 
plex numerical methods (such as finite-volume or finite- 
element methods) have been used. Note that for most 
existing ionic models there is no longer any need to 
copy equations from the original literature: most of the 
models are present in Auckland University’s Cell ML 
database (www.cellml.org). 


4 Applications 

Working with a whole-heart computer model is similar 
to experimental or clinical work, with the researcher 
having the same tools as experimentalists. For example, 
as in a real experiment we can put electrodes onto the 
virtual heart and initiate new waves just by adjusting 
the voltage at given points. 

Modeling can be applied to practically important 
problems. For example, it turns out that most cardiac 
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drags are blockers of certain ionic channels. We can 
model the effects of these drags by changing the con- 
ductance of the ionic channels G* in (1) and then study- 
ing how the changes affect the process of cardiac exci- 
tation. Similarly, many forms of inherited cardiac dis- 
ease occur as a result of a mutation in the gene-coding 
ionic channels. For example, LQT1 syndrome is due to a 
mutation that modifies (decreases) the potassium cur- 
rent, while LQT3 syndrome is the result of a mutation 
that modifies (increases) the sodium current. 

Thus, if one modifies the properties of ionic chan- 
nels to reproduce the effect of a given mutation, the 
effect of this mutation on cell excitation and the onset 
of arrhythmias can be studied. Furthermore, it is pos- 
sible to study the onset of arrhythmias due to various 
types of cardiac disease, such as ischemia or fibrosis, 
by changing the parameters of the model and by intro- 
ducing new types of cells. Compared with real experi- 
mental research, modeling opens up many more possi- 
bilities for modification of the heart's properties, and 
it can also be used to study three-dimensional wave 
propagation in the heart. 

There is a great deal of interest in applications that 
elucidate the mechanisms of the formation of spiral 
waves. Another important task is to understand and 
identify instabilities (bifurcations) that are responsi- 
ble for the deterioration of a single spiral wave into 
spatiotemporal chaos. 

Computational modeling also has potential applica- 
tions in clinical interventions: cardiac resynchroniza- 
tion therapy and cardiac ablation, for example. In resyn- 
chronization therapy, several electrodes are placed on 
a patient's heart and a cardiologist then adjusts the 
delays of excitation of these electrodes in order to 
optimize the heart’s pumping function. As this proce- 
dure is mainly used for patients who have suffered a 
heart attack, their hearts have abnormal properties and 
this procedure should thus be optimized on a patient- 
specific basis. Anatomically accurate modeling is one 
of the important tools needed for such optimization. 
Cardiac ablation is a procedure used by cardiologists 
to disrupt the pathological pathways along which the 
wave circulates in the heart during an arrhythmia. Here, 
too, modeling can help to identify such pathways and 
guide cardiologists during ablation. Although direct 
clinical applications of modeling are still under devel- 
opment, it is widely believed that they will become 
clinical tools during the next decade. 
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V.7 Chemical Reactions 

Martin Feinberg 


1 Introduction 

Chemical reactions underlie a vast spectrum of natu- 
ral phenomena and, as a result, play an indispensable 
role in many branches of science and engineering. An 
understanding of atmospheric chemistry, cell biology, 
or the efficient production of energy from fossil fuels 
requires, in one way or another, a framework for think- 
ing systematically about how chemical reactions serve 
to convert certain molecules into others. 

Chemical reactions can be studied at various concep- 
tual levels. At a very fundamental level, for example, 
one is concerned with ways in which chemical bonds 
are broken and created within and between individual 
molecules during the occurrence of a single chemical 
reaction. 

In this article our concerns will be different. In the 
biological cell and in the atmosphere there are a large 
number of distinct chemical species, and these are 
involved in a great variety of chemical reactions. A par- 
ticular species might be consumed in several reactions 
and produced in several others. Thus, the reactions can 
be intricately coupled, and, as a result, the dynamics of 
the species population might be quite complex. It is this 
dynamics that will be our focus. 

2 Some Primitives 

It will be useful to lay out some primitive ideas upon 
which the mathematics of chemical reaction systems 
can be built. 
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2.1 Species 

We begin by supposing that a chemical mixture we 
might wish to study comprises a fixed set of chemical 
species, which we will denote by Ai , A2 , . . . , A,y. Thus, 
Ai might be carbon monoxide, A2 might be oxygen, 
and so on. In applications, there is some judgment 
required when selecting the species set, for one might 
choose to ignore molecular entities that are deemed 
inconsequential to the task at hand. At each place in 
a mixture and at each instant, we associate with every 
species Ai a molar concentration ci. In rough terms, cl 
is the local number of molecules of A 1 per unit volume, 
divided by Avogadro’s number, approximately 6 x 10 23 . 
In this way, with each place in the mixture and with 
each instant we can associate a local composition vec- 
tor c = [ci,C2 ex] in the standard vector space of 

N- tuples, M n . The mathematics of chemical reactions 
is, for the most part, aimed at a description of how 
the composition vector varies with time and spatial 
position. 


2.2 Reactions 


For the mixture under study we suppose that the vari- 
ous species interact through the occurrence of a fixed 
set of chemical reactions that constitute a reaction net- 
work. The network is often displayed in a reaction 
diagram such as the one shown below: 


Ai : 

2 A 2 

P 

Ai + A3 : 

y 


«\ 7 - 

A2 + A 5 


( 1 ) 


The diagram is meant to suggest that a molecule 
of Ai can decompose to form two molecules of A2, 
that two molecules of A2 can combine to form one 
molecule of Ai, that a molecule of Ai can combine with 
a molecule of A3 to form a molecule of A 4 , and so on. 
(The Greek letters alongside the reaction arrows will 
play a role in the next section.) 

Here again, there is a need for some judgment about 
which reactions to include. Certain reactions might be 
deemed to occur so slowly as to be safely ignored for 
the purposes of the analysis. 


2.3 Reaction Rate Functions 

The local occurrence rates of individual reactions are 
presumed to be given by reaction rate functions— one 


for each reaction in the network at hand— that indicate 
how the individual reaction rates depend on the local 
mixture composition vector and the local temperature. 
Thus, in our example the function JKa 1 -2A 2 ( ■ , ■ ) tells us 
how the local occurrence rate per unit volume of reac- 
tion Ai -> 2A2 depends on local mixture conditions; in 
particular, JKa 1 ~2A 2 (c, T) is the local molar occurrence 
rate per unit volume when the local composition vec- 
tor is c and the local temperature is T. (Roughly speak- 
ing, the local molar occurrence rate per unit volume is 
the number of times the reaction occurs per unit time 
per unit volume divided by Avogadro’s number.) Thus, 
reaction rate functions take nonnegative values. 

These functions are presumed, in general, to be 
smooth and to have certain natural relationships with 
the particular reactions they describe. For example, it 
is often supposed— and we shall suppose here— that, 
for a reaction such as Ai + A3 — A4, J£a 1+ a 3 -A4(c, T) 
takes a strictly positive value at composition c if and 
only if both Ai and A3 are actually present— that is, if 
and only if the composition vector c is such that both 
ci and C3 are positive. 

A specification of a reaction rate function for each of 
the reactions in a network is called a kinetics for the 
network. By a kinetic system we will mean a reaction 
network together with a kinetics. 

2 . 3.1 Mass Action Kinetics 

By a mass action kinetics for a reaction network we 
mean a kinetics in which the individual reaction rate 
functions have a special, and very natural, form. 

For a reaction such as Ai -> 2A2 it is generally sup- 
posed that the local occurrence rate is simply propor- 
tional to Ci, the local concentration of Ai; after all, 
the more molecules of Ai there are locally, the more 
occurrences of the reactions there will be. Thus, it is 
presumed that J j £a 1 -2A 2 (c, T) = a(T)ci, where a(T) 
is called the (positive) rate constant for the reaction 
Ai - 2A 2 . 

For the reaction 2A2 — Ai the situation is differ- 
ent. In this case, two molecules of A2 must meet if 
the reaction is to proceed at all, and the probabil- 
ity of such a meeting is taken to be proportional to 
(C2) 2 . With this as motivation, the reaction rate func- 
tion is presumed to have the form J^a 2 -Ai (c, T) = 
f ( T) (C2 ) 2 , where, again, / 3 (T) is the rate “constant” 
for the reaction 2A2 — Ai. Similarly, we would have 
J£a 1+ a 3 -.A4 (c, T) = y(T)ciC 3 , and so on. 
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When the kinetics is mass action, it is common, as in 
(1), to decorate the reaction diagram with symbols for 
the various rate constants alongside the corresponding 
reaction arrows. 


2. 3.2 Other Kinetics 


Reaction networks deemed to describe chemistry at the 
fine level of so-called elementary reactions are usu- 
ally presumed to be governed by mass action kinet- 
ics. Sometimes, however, coarser modeling is called 
into play, and the “reactions” considered are actually 
intelligent but nevertheless approximate descriptions 
of what is actually happening. A model might speak in 
terms of an “overall” reaction that describes, in effect, 
the result of a sequence of more elementary chemical 
events. 

For example, a model might invoke an overall reac- 
tion such as Ai — A2 that is, in reality, mediated by an 
enzyme, present at very low concentration, that acts 
through an elementary reaction sequence such as 

Ai + E ^ AiE — * A2E — * A2 + E. 


The idea here is that Ai binds reversibly to the enzyme 
E to form AiE, whereupon the bound Ai is rapidly trans- 
formed by the enzyme into A2. The newly formed A2 
then unbinds, leaving behind naked enzyme, which is 
then free to rework its magic. 

This sequence of elementary steps is presumed to 
be governed by mass action kinetics, which, through a 
suitable approximation procedure, can be made to yield 
a pseudokinetics for the pseudoreaction Ai — ■ A2. Such 
a kinetics might take a Michaelis-Menten form: 


JT Al -A 2 (C,T) 


k(T)c 1 
k'(T) + ci ' 

Although the (roughly constant) concentration of the 
enzyme might influence, in a latent way, values of 
the parameters (in particular k), the approximate rate 
function for the pseudoreaction Ai — A2 is written 
explicitly only in terms of the concentration of Ai. 


2A2 occurs, we lose one molecule of Ai, and the local 
occurrence rate per unit volume of that reaction is 
J£ai_2A 2 (c, T). Whenever the reaction 2A2 — A\ occurs 
we gain a molecule of Ai , and the local occurrence rate 
per unit volume of that reaction is J^a 2 -Ai (c, T). After 
also taking into account the contributions of the reac- 
tions Ai + A3 ->■ A4, A4 — Ai + A3, and A2 + A5 ->■ Ai + A3 
to the production of species Ai , we can calculate the net 
molar production rate per unit volume of Ai, denoted 
rile, T), as follows: 


ri (C, T) = - J^A 1 _2A 2 (C, T) + JCi A 2 _Ai (c, T) 

— ’ T^Ai +A3 — A4 W, T) -f- 3 Ca^ — Ai + A3 (t:, T) 


+ . Aa 2 + A5 — Ai + A3 i c , T ) . 


For species A2, we can proceed in the same way, but 
we need to take cognizance of the fact that with each 
occurrence of the reaction Ai -> 2A2 there is a gain of 
two molecules of A2, and with each occurrence of the 
reverse reaction there is a loss of two molecules of A2. 
Thus we write 

r 2 (c,T) = 2 J«a 1 ^ 2 a 2 (c, T) - 2 oe' 2 A 2 ~A 1 (c,T) 

-t TXCaa — a 2 +A5 ( e , T) J^Cai +A5 — Ai +A3 ie, T ) . 


In this way we can formulate, for a kinetic system 
with N species, the N functions fi(- , ■ ), L = 1 , 2 , . . . , N. 
For each L, ri ( ■ , ■ ) is just the sum, over all reactions, of 
the individual reaction rate functions, each multiplied 
by the net number of molecules of species A 1 produced 
with each occurrence of the corresponding reaction. 

For use later on, we shall find it convenient to form 
the vectorial species-formation-rate function r(-, ■) for 
a kinetic system, which is defined by 

r(c, T) := [ri (c,T),r2(c,T) rjv(c, T)] e R N . 

4 How Kinetic Systems Give 
Rise to Differential Equations 


3 The Species-Formation-Rate 
Function for a Kinetic System 

Given a reaction network endowed with a kinetics, we 
are in a position to calculate for each species its local 
net molar production rate per unit volume when the 
local composition vector is c and the local temperature 
is T. 

For our sample reaction network, let us focus for the 
moment on species Ai. Whenever the reaction Ai — 


There are various physicochemical settings in which 
a kinetic system might exert itself, and there are cor- 
responding differences in how the governing differ- 
ential equations are formulated. In each setting, how- 
ever, the presence of chemical reactions is manifested 
through the species-formation-rate function for the 
kinetic system at hand. 

Our focus here will be almost entirely on what 
chemical engineers call the spatially homogeneous 
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(well-stirred) fixed-volume batch reactor, for that is the 
setting in which chemical reactions exert themselves in 
the most pristine way, uncomplicated by composition 
changes resulting from bulk or diffusive transport of 
matter, 

4. 1 The Homogeneous Batch Reactor 


when the kinetics is mass action: 

ci = -aci + P(c 2 ) 2 - ycic 3 + Sc 4 + §c 2 c 5 ,' 
C 2 = 2«Ci - 2/f(C 2 ) 2 + EC 4 - %C 2 C$, 

C3 = -ycic 3 + Sc 4 + §C 2 C 4 , 
c 4 = ycic 3 - (5 + e)c 4 , 

CS = EC 4 - 5 C 2 C 5 . 


(4) 


Imagine that a mixture fills a vessel of fixed volume V 
and that the mixture is stirred so effectively that its 
composition and temperature are always independent 
of spatial position. Imagine also that heat can be added 
to and removed from the vessel by an operator in any 
desired way and that, in particular, the temperature 
variation with time can be controlled to an exquisite 
degree. One possibility is that the temperature can be 
maintained at a fixed value for all time, in which case 
the operation is isothermal. 

We suppose that the chemistry is well-modeled, as 
before, by a kinetic system with N species, Ai , A 2 , . . . , 
An. In this case, the spatially independent composition 

vector at time t is c(t) = [c\(t),c 2 (t) cjv(t)] e M N . 

In fact, our interest is in the temporal variation of the 
mixture composition. 

If we denote by ni(t) the number of moles of species 
Al in the reactor at time t— that is, the number of 
molecules of A l divided by Avogadro’s number— and 
if we recall that rr(c(t), T(t)) is the net rate of produc- 
tion per unit volume of Al due to the occurrence of all 
reactions in the underlying reaction network, then we 
can write hi(t) = Vri(c(t), T(t)). (An overdot always 
indicates time differentiation.) Dividing by V, we get the 
system of N scalar differential equations 

c L = r L (c{t), T(t)), L = 1,2,..., JV, (2) 
or, equivalently, the single vector equation 

c(t) = r(c(t),T(t)). (3) 

A possibility that we have not considered is that the 
reactor is adiabatic, which is to say that the vessel is 
perfectly insulated. In this case, energetic considera- 
tions must be brought into play through an additional 
differential equation for the temperature. In particular, 
one must take account of the fact that the occurrence of 
chemical reactions might, by itself, serve to transiently 
raise or lower the mixture temperature. 

It is instructive to write out the isothermal batch reac- 
tor differential equations (2) for reaction network (1) 


Note that this is a system of coupled polynomial dif- 
ferential equations in five dependent variables and in 
which six parameters (rate constants) appear. 

Polynomial differential equations are notoriously dif- 
ficult to study. In fact, one of David Hilbert’s famous 
problems (the sixteenth) posed in 1900 at the Interna- 
tional Congress of Mathematicians in Paris was about 
polynomial differential equations in just two variables. 
It remains unsolved. 

4.2 A Few Words about Other Physicochemical 
Settings 

In the batch reactor, composition changes result solely 
from the occurrence of chemical reactions. In other set- 
tings, composition variations in time or in spatial posi- 
tion might result from a coupling of reactions with bulk 
or diffusive transport of matter. 

A continuous-flow stirred-tank reactor is very much 
like a batch reactor, but fresh mixture is added to the 
reactor continuously at a fixed flow rate, while mix- 
ture is simultaneously removed from the reactor at that 
same flow rate. When the mixture density can be pre- 
sumed to be independent of composition and temper- 
ature, the governing vector differential equation takes 
the form 

c(t) = ^(c F - c(t)) + r(c(t), T(t)). 

Here c F e M. N is the feed composition, and 0 is the reac- 
tor residence time, which is the reactor volume divided 
by the volumetric flow rate. 

In still other settings, there might be continuous vari- 
ation in the composition not only in time but also in 
spatial position. In such instances, mixtures are usually 
modeled by systems of reaction-diffusion equations, 
about which there is a large literature. 

Finally, we note that certain systems open to trans- 
port of matter are sometimes modeled as if they were 
batch reactors by adding pseudoreactions of the form 
0 — ■ Ai or Al — • 0 to account for the supply and removal 
of species Al. 
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5 Seeing Things Geometrically 

We have already pointed out that, even for a relatively 
simple reaction network such as (1), the corresponding 
differential equations can be difficult to study. What we 
have not yet exploited, however, is the fact that these 
equations bear an intimate structural relationship to 
the reaction network from which they derive, a relation- 
ship that imposes severe geometric constraints on the 
way that composition trajectories can travel through 
R N . An understanding of that geometry can help con- 
siderably in studying the dynamics to which the equa- 
tions give rise and in posing the right questions at the 
outset. 

To begin, we will need to make more mathematically 
transparent the way in which the species-formation- 
rate function r(-, ■) relates structurally to the under- 
lying reaction network. 

5.1 Complexes and Complex Vectors 

By the complexes of a reaction network we mean the 
objects that sit at the heads and tails of the reaction 
arrows. Thus, for network ( 1 ) the complexes are Ai, 
2A2, Ai + A3, A4, and A2 + A5. 

With each complex in an N-species network we can 
associate a vector of R N in a natural way. This is best 
explained in terms of our sample network (1), for which 
N = 5 : with the complex Ai we associate the complex 
vector yi := [ 1 , 0 , 0 , 0 , 0 ] E R 5 ; with 2A2 we associate 
the vector^ := [0, 2,0, 0,0]; with Ai + A3 we associate 
the vector V3 := [1, 0, 1, 0, 0]; with A4 we associate the 
vector y4 := [ 0 , 0 , 0 , 1 , 0 ]; and with A2 +A5 we associate 
the vector ys := [0, 1,0,0, 1]. 

5.2 Reactions and Reaction Vectors 

Note that we have numbered the complexes in an arbi- 
trary way. Hereafter, we write V; — ■ yj (or sometimes 
just i -> j) to indicate the reaction whereby the ith 
complex reacts to the jth complex. 

In fact, we will associate with each reaction y, — yj in 
a network the corresponding reaction vector yj - yi E 
R n , with N again denoting the number of species. In 
our example, the reaction vector corresponding to the 
reactionAi -> 2A2 is [- 1 , 2 , 0 , 0 , 0 ]. The reaction vector 
corresponding to A2 + A5 — Ai + A3 is [ 1 , - 1 , 1 , 0, - 1 ] . 
Note that theLth component of the reaction vector for a 
particular reaction is just the net number of molecules 
of species A 1 produced with each occurrence of that 
reaction. 


5.3 The Species-Formation-Rate Function Revisited 

Consider a kinetic system, for which we denote by ^ 
the set of reactions and by ■Mi-jes? the cor- 

responding set of reaction rate functions. We claim 
that the (vector) species-formation-rate function can be 
written in terms of the reaction vectors {yj - yrii-jege 
as 

r(c, T) = X j(c,T)(yj- yi ) ( 5 ) 

for all c and T. To see that this is so, it helps to inspect 
the component of ( 5 ) corresponding to species Ar,: 

n(c,T)= X j(c,T)(y jL -y iL ). ( 6 ) 

Recall how the species formation rate is calculated for 
species Ar,: by summing, over all reactions, the individ- 
ual reaction rate functions, each multiplied by the net 
number of molecules of Ai produced when the corre- 
sponding reaction occurs once. This is precisely what 
is done in ( 6 ). 

5.4 What the Vector Formulation Tells Us 

A glance at ( 5 ) indicates that, no matter what the local 
mixture composition and temperature are, the species- 
formation-rate vector is invariably a linear combination 
(in fact, a nonnegative linear combination) of the reac- 
tion vectors for the operative chemical reaction network. 
This tells us that the species-formation-rate vector is 
constrained to point only in certain directions in R w ; it 
can point only along the linear subspace of R N spanned 
by the reaction vectors. 

This is called the stoichiometric subspace for the 
reaction network, and we denote it by the symbol S\ 

S := spantj^- - yi e R N : yi — yj e (%\. 

(Stoichiometry is the part of elementary chemistry 
that takes account of conservation of various entities 
(e.g., mass, charge, atoms of various kinds) that are 
conserved even in the presence of change caused by 
chemical reactions.) 

We will have interest in the dimension of the stoi- 
chiometric subspace. This is identical to the rank of the 
network’s reaction vector set— that is, the number of 
vectors in the largest linearly independent set that can 
be formed from the network’s reaction vectors. This 
number, denoted 5, is called the rank of the network: 

5 := dimS = rank{yj - yi E R w : V; — yj E &}. 
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In language we now have available, we can say that, 
for any composition and temperature, r(c, T) is con- 
strained to point only along the stoichiometric sub- 
space. Provided that S is indeed smaller than R N (i.e., 
5 < AT), this has important dynamical implications. 
We will examine these implications soon, but first we 
should see that, for networks that respect conservation 
of mass, S must be smaller than R N . 

5.5 An Elementary Property of Mass-Conserving 
Reaction Networks 

Suppose that, for our sample network ( 1 ), Mi, Mz , ... , 
Ms are the molecular weights of the five species Ai , A2 , 
. . . , A5 . Although a chemical transformation takes place 
when the reaction Ai -> 2A2 occurs, we nevertheless 
expect the reaction to conserve mass. That is, we expect 
that the total molecular weight of molecules on the 
reactant side of the arrow will be the same as the total 
molecular weight of molecules on the product side. 
More precisely, if the reaction does indeed respect mass 
conservation, we expect that M 1 = 2M2. Similarly, if 
the reaction Ai + A3 — A4 is consistent with mass 
conservation, then we should have M\ + Ms = M4. 

More generally, if 33 -> Vy is a reaction in a network 
with N species, then the reaction will be consistent with 
mass conservation only if 

N N 

X ynM L = X yiL M L, 

L = 1 1=1 

where, as in our example, M \ , M2 , . . . , Mjv are the molec- 
ular weights of the species. Letting M = [Mi , M2 , . . . , 
Mjv] be the vector of molecular weights, we can express 
this condition in vector terms through the standard 
scalar product in R N : 

yi ■ M = yj ■ M. 

If all reactions of the network respect mass conserva- 
tion, then we have 

(Vi ~ yi) ■ M = 0, V3'i - yj 6 gg. 

This is to say that M is orthogonal to each of the reac- 
tion vectors and, therefore, to their span, whereupon S 
is indeed smaller than R N . 

5.6 The Batch Reactor Viewed Geometrically 

It is in the batch reactor setting that we can see most 
clearly the implications of the fact that the species- 
formation-rate function takes values in the stoichio- 
metric subspace for the underlying reaction network. 


Recall that the governing vector differential equa- 
tion is c(t) = r(c(t), T(t)). Here r(-, ■) is the species- 
formation-rate function for the operative kinetic sys- 
tem. For that system we will suppose that the stoichio- 
metric subspace is smaller than R N . To be concrete, we 
will also suppose that the temperature variation with 
time is controlled by an external agent. (The temper- 
ature might be maintained constant, but that is not 
essential to what we will say.) 

Our interest will be in understanding how the com- 
position vector moves around the nonnegative orthant 
of R n if an initial composition c( 0 ) = c° e R N is speci- 
fied. By means of the following analogy we can see why 
the motion is very highly constrained by the reaction 
network itself, no matter what the kinetics might be. 

5 . 6.1 A Bug Analogy 

Imagine that a friendly bug is sitting at the very top of 
the doorknob in your bedroom. At time t = 0 the bug 
begins flying around, and we are interested in predict- 
ing where in your bedroom the bug will be a minute 
later. (It is winter. The doors and windows are closed. 
We are assuming that the bug suffers no violence.) 
Without further information, it is hard to say anything 
meaningful in advance about the bug’s position, even 
qualitatively. 

But suppose that, for some deep entomological rea- 
son, the bug chooses to have its velocity vector always 
point along your bedroom floor. Then it is intuitively 
clear that, for all t ^ 0, the bug must be on the plane 
parallel to the floor that passes through the top of the 
doorknob. True, we do not know precisely where the 
bug will be on that plane, but we do know a lot about 
where he or she cannot be. 

5 . 6.2 Stoichiometric Compatibility > 

For stoichiometric rather than entomological reasons, 
the batch reactor behaves in essentially the same way. 
Because r ( • , ■ ) invariably takes values in S, the compo- 
sition “velocity” c must always point along S. It is at 
least intuitively clear, then, that for all t ^ 0, c(t) must 
lie in the translate of S containing c°. 

In figure 1 we depict schematically the situation for 
a kinetic system based on a three-species network with 
reactions Ai +± A2 and Ai + A2 ^ A3. The stoichiomet- 
ric subspace is two dimensional: it is a plane through 
the origin that sits behind the nonnegative orthant, 
M^.. We denote by c° + S the translate of the stoichio- 
metric subspace containing the initial composition c°. 
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Thus, we expect the resulting composition trajectory 
to reside entirely within the intersection of c° + S with 
That intersection appears as a shaded triangle in 
figure 1 . 

The set of compositions within the triangle is an 
example of what we shall call a stoichiometric compat- 
ibility class. A composition outside the triangle is inac- 
cessible from a composition within the triangle; the two 
compositions are not stoichiometrically compatible. 

More generally, for a reaction network with stoichio- 
metric subspace S c R N we say that two compositions 
c and c' are stoichiometrically compatible if c' - c is 
a member of S. Thus, the set R^ of all possible com- 
positions can be partitioned into stoichiometric com- 
patibility classes: that is, subsets of compositions that 
are mutually stoichiometrically compatible. As with our 
example, each of these is a convex set of the form 
(c° + S) n R^ for some vector c° 6 R^. A batch reac- 
tor composition trajectory that begins within a given 
stoichiometric compatibihty class will never leave it. 

6 A Little Chemical Reaction Network Theory 

Although we have discovered, at least for the batch 
reactor, that composition trajectories are severely con- 
strained geometrically, the deepest questions are about 
what might happen within the various stoichiometric 
compatibility classes. 


Indeed, when questions are posed about the capac- 
ity of a given kinetic systems to admit, for exam- 
ple, multiple equilibria, one is generally asking about 
the possibility of two or more equilibria that are sto- 
ichiometric ally compatible with each other: that is, 
that reside within the same stoichiometric compati- 
bility class. Similarly, when one asks about the sta- 
bility of a given equilibrium, one is generally inter- 
ested in stability relative to initial conditions that are 
stoichiometrically compatible with it. 

If we think back to the mass action differential equa- 
tions (4) that derived from our relatively simple net- 
work example, it is easy to see that these questions can 
be extremely difficult, and answers might depend on 
parameter values (e.g., rate constants). Nevertheless, in 
the early 1970s a body of theory began to emerge that, 
especially in the mass action case, draws firm connec- 
tions between qualitative properties of the differential 
equations for a kinetic system and the structure of the 
underlying network of chemical reactions. We need just 
a little more vocabulary to state, as an important exam- 
ple, an early reaction network theory result: the defi- 
ciency zero theorem, due to F. Horn, R. Jackson, and 
M. Feinberg. 

6.1 Some More Vocabulary 

It should be noted that, in our display of reaction net- 
work ( 1 ), each complex was written just once, and then 
arrows were drawn to indicate how the complexes are 
connected by reactions. Formulated this way, the so- 
called standard reaction diagram becomes a directed 
graph, with the complexes serving as vertices and the 
reaction arrows serving as directed edges. Note that 
the diagram in ( 1 ) has two connected components: one 
containing the complexes Ai and 2 A 2 , and the other 
containing the complexes Ai + A 3 , A 4 , and A 2 + A 5 . In 
the language of chemical reaction network theory the 
(not necessarily strong) components of the standard 
reaction diagram are called the linkage classes of the 
network. 

By a reversible reaction network a chemist usually 
means one in which each reaction is accompanied by 
its reverse. By a weakly reversible reaction network we 
mean one for which, in the standard reaction diagram, 
each reaction arrow is contained in a directed cycle. 
Every reversible network is weakly reversible. Network 
( 1 ) is weakly reversible but not reversible. 

The deficiency of a reaction network is a nonnega- 
tive integer index with which reaction networks can be 
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classified. It is defined by 

deficiency := #complexes - #linkage classes - rank. 

The deficiency is not a measure of a network’s size. A 
deficiency zero network can have hundreds of species 
and hundreds of reactions. 

6.2 The Deficiency Zero Theorem 

In what follows, it will be understood that we are con- 
sidering the isothermal version of (3), c = r(c). When 
we speak of a positive stoichiometric compatibility class 
we mean a (nonempty) intersection of a translate of 
the stoichiometric subspace with the strictly positive 
orthant of R N (i.e., the interior of R N ). 

It is well known that mass action systems can admit, 
within a positive stoichiometric compatibility class, 
unstable equilibria, multiple equilibria, and cyclic com- 
position trajectories. The following theorem tells us 
that we should not expect such phenomena when the 
underlying reaction network has a deficiency of zero, 
no matter how complicated it might be. 

Theorem 1. For any reaction network of deficiency 
zero the following statements hold true. 

(i) If the network is not weakly reversible, then for 
any kinetics, not necessarily mass action, the cor- 
responding differential equations cannot admit an 
equilibrium in which all species concentrations are 
positive, nor can they admit a cyclic composition 
trajectory that passes through a composition in 
which all species concentrations are positive. 

(ii) If the network is weakly reversible, then, when 
the kinetics is mass action (but regardless of the 
positive values the rate constants take), each pos- 
itive stoichiometric compatibility class contains 
precisely one equilibrium, and it is asymptotically 
stable. Moreover, there is no nontrivial cyclic com- 
position trajectory along which all species concen- 
trations are positive. 

Proof of the theorem is a little complicated. Its sec- 
ond part involves a Lyapunov function suggested by 
classical thermodynamics. 

Recall the system of isothermal mass action differen- 
tial equations (4) for our sample network (1). Without 
the help of an overarching theory it would be difficult 
to determine, even for specified values of the rate con- 
stants, the presence, for example, of multiple stoichio- 
metrically compatible positive equilibria or a positive 
cyclic composition trajectory. 


Note, however, that the network has five complexes, 
two linkage classes, and rank three. Thus, its deficiency 
is zero. Moreover, the network is weakly reversible. The 
theorem tells us immediately that, regardless of the rate 
constant values, the dynamics must be of the relatively 
dull, stable kind that the theorem describes. 

For an introduction to more recent and very dif- 
ferent reaction network theory results, see the article 
by Craciun et al., which is more graph- theoretical in 
spirit and which has an emphasis on enzyme-driven 
chemistry. 
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V.8 Divergent Series: Taming the Tails 

Michael V. Berry and Christopher J. 

Howls 

1 Introduction 

By the seventeenth century, in the held that became 
the theory of convergent series, it was beginning to 
be understood how a sum of infinitely many terms 
could be finite; this is now a fully developed and largely 
standard element of every mathematician's education. 
Contrasting with those series, we have the theory of 
series that do not converge, especially those in which 
the terms first get smaller but then increase factorially: 
this is the class of “asymptotic series,” encountered fre- 
quently in applications, with which this article is mainly 
concerned. Although now a vibrant area of research, 
the development of the theory of divergent series has 
been tortuous and has often been accompanied by con- 
troversy. As a pedagogical device to explain the subtle 
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concepts involved, we will focus on the contributions 
of individuals and describe how the ideas developed 
during several (overlapping) historical epochs, often 
driven by applications (ranging from wave physics to 
number theory). This article complements perturba- 
tion theory and asymptotics [IV.5] elsewhere in this 
volume. 

2 The Classical Period 

In 1747 the Reverend Thomas Bayes (better known for 
his theorem in probability theory) sent a letter to Mr. 
John Canton (a Fellow of the Royal Society); it was 
published posthumously in 1763. Bayes demonstrated 
that the series now known as Stirling’s expansion for 
log(z!), “asserted by some eminent mathematicians,” 
does not converge. Arguing from the recurrence rela- 
tion relating successive terms of the series, he showed 
that the coefficients “increase at a greater rate than 
what can be compensated by an increase of the pow- 
ers of z, though z represent a number ever so large.” 
We would now say that this expansion of the facto- 
rial function is a factorially divergent asymptotic series. 
The explicit form of the series, written formally as an 
equality, is 

log(z!) = (z + ^) log z + log V2tt - z 

+ 1 y t—i) r — — 

2tt 2 z „ (2nz) 2r ' 

r= 0 

where 

00 1 

= ( 2r > ! X nTTvi ■ 

n = 1 

Bayes claimed that Stirling's series “can never prop- 
erly express any quantity at all” and the methods used 
to obtain it “are not to be depended upon.” 

Leonhard Euler, in extensive investigations of a wide 
variety of divergent series beginning several years after 
Bayes sent his letter, took the opposite view. He argued 
that such series have a precise meaning, to be decoded 
by suitable resummation techniques (several of which 
he invented): “if we employ [the] definition . . . that . . . the 
sum of a series is that quantity which generates the 
series, all doubts with respect to divergent series vanish 
and no further controversy remains.” 

With the development of rigorous analysis in the 
nineteenth century, Euler’s view, which as we will see 
is the modern one, was sidelined and even derided. As 
Niels Henrik Abel wrote in 1828: “Divergent series are 
the invention of the devil, and it is shameful to base 
on them any demonstration whatsoever.” Nevertheless, 


divergent series, especially factorially divergent ones, 
repeatedly arose in application. Toward the end of the 
century they were embraced by Oliver Heaviside, who 
used them in pioneering studies of radio-wave propa- 
gation. He obtained reliable results using undisciplined 
semiempirical arguments that were criticized by math- 
ematicians, much to his disappointment: “It is not easy 
to get up any enthusiasm after it has been artificially 
cooled by the wet blanket of rigorists.” 

3 The Neoclassical Period 

In 1886 Henri Poincare published a definition of asymp- 
totic power series, involving a large parameter z, that 
was both a culmination of previous work by analysts 
and the foundation of much of the rigorous mathemat- 
ics that followed. A series of the form Zn=o a n/z n is 
defined as asymptotic by Poincare if the error resulting 
from truncation at the term n = N vanishes as K/z N+1 
(K > 0) as | z | — 00 in a certain sector of the complex z- 
plane. In retrospect, Poincare’s definition seems a retro- 
grade step because, although it encompasses conver- 
gent as well as divergent series in one theory, it fails 
to address the distinctive features of divergent series 
that ultimately lead to the correct interpretation that 
can also cure their divergence. 

It was George Stokes, in research inspired by physics 
nearly four decades before Poincare, who laid the foun- 
dations of modern asymptotics. He tackled the problem 
of approximating an integral devised by George Airy to 
describe waves near caustics, the most familiar exam- 
ple being the rainbow. This is what we now call the Airy 
function Ai(z), defined by the oscillatory integral 

Ai(z)= 2^L exp (^ 3+izt ) df ’ 

the rainbow-crossing variable being Rez (figure 1) and 
the light intensity being Ai 2 (z). Stokes derived the 
asymptotic expansion representing the Airy function 
for z > 0 and showed that it is factorially divergent. 
His innovation was to truncate this series not at a fixed 
order N but at its smallest term ( optimal truncation), 
corresponding to an order N(z) that increases with z. 
By studying the remainder left after optimal trunca- 
tion, he showed that it is possible to achieve exponen- 
tial accuracy (figure 1), far beyond the power-law accu- 
racy envisaged in Poincare’s definition. We will call such 
optimal truncation superasymptotics. 

Superasymptotics enabled Stokes to understand a 
much deeper phenomenon — one that is fundamental to 
the understanding of divergent series. In Ai(z), z > 0 
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Figure 1 (a) The rainbow-crossing variable, along any line 
transverse to the rainbow curve, (b) The Airy function, 
(c) The error from truncating asymptotic series for Ai(z) at 
the term n — 1, for z = 5.24148; optimal truncation occurs 
at the nearest integer to F = 4z 3l2 /3, i.e., at n = 16. 


corresponds to the dark side of the rainbow, where the 
function decays exponentially; physically, this repre- 
sents an evanescent wave. On the bright side z < 0, the 
function oscillates trigonometrically, that is, as the sum 
of two complex exponential contributions, each repre- 
senting a wave; the interference of these waves gener- 
ates the “supernumerary rainbows” (whose observation 
was one of the phenomena earlier adduced by Thomas 
Young in support of his view that light is a wave phe- 
nomenon). One of these complex exponentials is the 
continuation across z = 0 of the evanescent wave on 
the dark side. But where does the other originate? 

Stokes’s great discovery was that this second expo- 
nential appears during continuation of Ai(z) in the 
complex plane from positive to negative z, across what 


Stokes line I mz 



Figure 2 The complex plane of argument z (whose real part 
is the z of figure 1) of Ai(z), showing the Stokes line at 
argz = 120°. 

is now called a “Stokes line,” where the dark-side expo- 
nential reaches its maximum size. Alternatively stated, 
the small (subdominant) exponential appears when 
maximally hidden behind the large (dominant) one. For 
Ai(z) the Stokes line is argz = 120° (figure 2). 

Stokes thought that the least term in the asymp- 
totic series, representing the large exponential, con- 
stituted an irreducible vagueness in the description 
of Ai(z) in his superasymptotic scheme. By quanti- 
tative analysis of the size of this least term, Stokes 
concluded that only at maximal dominance could this 
obscure the small exponential, which could then appear 
without inconsistency. As we will explain later, Stokes 
was wrong to claim that superasymptotics — optimal 
truncation — represents the best approximation that 
can be achieved within asymptotics. But his identifi- 
cation of the Stokes line as the place where the small 
exponential is born (figure 3) was correct. Moreover, he 
also appreciated that the concept was not restricted to 
Ai(z) but applies to a wide variety of functions aris- 
ing from integrals, solutions of differential equations 
and recurrence equations, etc., for which the associated 
asymptotic series are factorially divergent. 

This Stokes phenomenon, connecting different expo- 
nentials representing the same function, is central to 
our current understanding of such divergent series, 
and it is the feature that distinguishes them most 
sharply from convergent ones. In view of this semi- 
nal contribution, it is ironic that George (“G. H.”) Hardy 
makes no mention of the Stokes phenomenon in his 
textbook Divergent Series. Nor does he exempt Stokes 
from his devastating assessment of nineteenth-century 
English mathematics: “there [has been] no first-rate 
subject, except music, in which England has occupied 
so consistently humiliating a position. And what have 
been the peculiar characteristics of such English math- 
ematics?... For the most part, amateurism, ignorance, 
incompetence, and triviality.” 
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Figure 3 Approximations to Ai(ze lS ) in the Argand plane, 
(Re(Ai), Im(Ai)) for z = 1.31..., i.e., F = 4z 3/2 /3 = 2, 
plotted parametrically from 0 = 0° (black circle) to 
0 = 180° (black square). The curves are exact Ai (solid 
black line), lowest-order asymptotics (no correction terms; 
dotted black line), optimal truncation without the Stokes 
jump (dot-dashed black line), optimal truncation including 
the Stokes jump (dashed black line), and optimal trunca- 
tion including the smoothed Stokes jump (solid gray line). 
For this value of F, the optimally truncated sum contains 
only two terms; on the scale shown, the Stokes jump would 
be invisible for larger F. Note that without the Stokes jump 
(solid and dotted black lines), the asymptotics must deviate 
from the exact function beyond the Stokes line at 0 = 120°. 

4 The Modern Period 

Late in the nineteenth century, Jean-Gaston Darboux 
showed that for a wide class of functions the high 
derivatives diverge factorially. This would become an 
important ingredient in later research, for the follow- 
ing reason. Asymptotic expansions (particularly those 
encountered in physics and applied mathematics) are 
often based on local approximations: the steepest- 
descent method for approximating integrals is based 
on local expansion about a saddle point; the phase- 
integral method for solving differential equations (e.g., 
the Wentzel-Kramers-Brillouin (WKB) approximation 
to Schrodinger’s equation in quantum mechanics) is 
based on local expansions of the coefficients; and so on. 
Therefore, successive orders of approximation involve 
successive derivatives, and the high orders, respon- 
sible for the divergence of the series, involve high 
derivatives. 

Another major late-nineteenth-century ingredient of 
our modern understanding was Emile Borel’s develop- 
ment of a powerful summation method in which the 
factorials causing the high orders to diverge are tamed 
by replacing them with their integral representations. 


Often this enables the series to be summed “under 
the integral sign.” Underlying the method is the formal 
equality 



Reading this from right to left is instructive. Inter- 
changing summation and integration shows why the 
series on the left diverges if the a r increase factori- 
ally (as in the cases we are considering): the integral 
is over a semi-infinite range, yet the sum in the inte- 
grand converges only for |t/z| < 1. Borel summation 
effectively repairs an analytical transgression that may 
have caused the divergence of the series. The power 
of Borel summation is that, as was fully appreciated 
only later, it can be analytically continued across Stokes 
lines, where some other summation techniques (e.g., 
Pade approximants) fail. 

Now we come to the central development in mod- 
ern asymptotics. In a seminal and visionary advance, 
motivated initially by mathematical difficulties in eval- 
uating some integrals occurring in solid-state physics 
and developed in a series of papers culminating in a 
book published in 1973, Robert Dingle synthesized ear- 
ner ideas into a comprehensive theory of factorially 
divergent asymptotic series. 

Dingle's starting point was Euler's insight that diver- 
gent series are obtained by a sequence of precisely 
specified mathematical operations on the integral or 
differential equation defining the function being ap- 
proximated, so the resulting series must represent the 
function exactly, albeit in coded form, which it is the 
task of asymptotics to decode. Next was the realiza- 
tion that Darboux’s discovery that high derivatives 
diverge factorially implies that the high orders of a 
wide class of asymptotic series also diverge factorially. 
This in turn means that the terms beyond Stokes’s opti- 
mal truncation— representing the tails of such series 
beyond superasymptotics — can all be Borel-summed in 
the same way. 

The next insight was Dingle’s most original contribu- 
tion. Consider a function represented by several differ- 
ent formal asymptotic series (e.g., those correspond- 
ing to the two exponentials in Ai(z)), each represent- 
ing the function differently in sectors of the complex 
plane separated by Stokes lines. Since each series is 
a formally complete representation of the function, 
each must contain, coded into its high orders, informa- 
tion about all the other series. Darboux’s factorials are 
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therefore simply the first terms of asymptotic expan- 
sions of each of the late terms of the original series. 
Dingle appreciated that the natural variables implied by 
Darboux’s theory are the differences between the var- 
ious exponents; these are usually proportional to the 
large asymptotic parameter. In the simplest case, where 
there are only two exponentials, there is one such vari- 
able, which Dingle called the singulant, denoted by F. 
For the Airy function Ai(z), F = 4z 3/2 /3. 

We display Dingle’s expression for the high orders for 
an integral with two saddle points a and b, correspond- 
ing to exponentials e _F “ and e~ Fb withF a ;, = Fj , -F a and 
series with terms F, ( - a> and Ty h) ; for r s> 1, the terms 
of the a series are related to those of the b series by 


jia) = K (r - 1 )! /jib) 

Kb ' ° 


Fab rj,{b) 
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( r - 1 )(r - 2) 
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in which K is a constant. This shows that, although the 
early terms T { 0 a \ T} a> , F 2 (c!) , . . . of an asymptotic series 
can rapidly get extremely complicated, the high orders 
display a miraculous functional simplicity. 

With Borel’s as the chosen summation method, Din- 
gle's late-terms formula enabled the divergent tails of 
series to be summed in terms of certain terminant inte- 
grals and then reexpanded to generate new asymp- 
totic series that were exponentially small compared 
with the starting series. He envisaged that “these ter- 
minant expansions can themselves be closed with new 
terminants; and so on, stage after stage.” Such resum- 
mations, beyond superasymptotics, were later called 
hyperasymptotics. 

Dingle therefore envisaged a universal technique 
for repeated resummation of factorially divergent se- 
ries, to obtain successively more accurate exponential 
improvements far beyond that achievable by Stokes’s 
optimal truncation of the original series. The meaning 
of universality is that although the early terms— the 
ones that get successively smaller— can be very differ- 
ent for different functions, the summation method for 
the tails is always the same, involving terminant inte- 
grals that are the same for a wide variety of functions. 
The method automatically incorporates the Stokes phe- 
nomenon. Although Dingle clearly envisaged the hyper- 
asymptotic resummation scheme as described above, 
he applied it only to the first stage; this was sufficient to 
illustrate the high improvement in numerical accuracy 
compared with optimal truncation. 



Figure 4 The Stokes multiplier (solid lines) and er- 
ror-function smoothing (dashed lines) for Ai(ze ie ), for 
z = 1.717... (i.e., F = 4z 3/2 /3 = 3) and z = 3.831... (i.e., 
F = 4z 3I2 I3 = 10). 


Like Stokes before him, Dingle presented his new 
ideas not in the “lemma, theorem, proof” style famil- 
iar to mathematicians but in the discursive manner 
of a theoretical physicist, perhaps explaining why it 
has taken several decades for the originality of his 
approach to be widely appreciated and accepted. Mean- 
while, his explicit relation, connecting the early and late 
terms of different asymptotic series representing the 
same function, was rediscovered independently by sev- 
eral people. In particular, Jean Ecalle coined the term 
resurgence for the phenomenon, and in a sophisticated 
and comprehensive framework applied it to a very wide 
class of functions. 

5 The Postmodern Period 

One of the first steps beyond superasymptotics into 
hyperasymptotics was an application of Dingle’s ideas 
to give a detailed description of the Stokes phe- 
nomenon. In 1988 one of us (Michael Berry) resummed 
the divergent tail of the dominant series of the expan- 
sion, near a Stokes line, of a wide class of functions, 
including Ai(z). The analysis demonstrated that the 
subdominant exponential did not change suddenly, as 
had been thought; rather, it changed smoothly and, 
moreover, in a universal manner. In terms of Dingle’s 
singulant F, now defined as the difference between the 
exponents of the dominant and subdominant exponen- 
tials, the Stokes line corresponds to the positive real 
axis in the complex F-plane; asymptotics corresponds 
to ReF » 1; and the Stokes phenomenon corresponds 
to crossing the Stokes line, that is ImF passing through 
zero. 
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The result of the re summation is that the change 
in the coefficient of the small exponential— the Stokes 
multiplier— is universal for all factorially divergent 
series and it is proportional to 

H-“(vtS ?))■ 

In the limit Rei 7 — °o this becomes the unit step. For 
Ref large but finite, the formula describes the smooth 
change in the multiplier (figure 4) and makes precise 
the description given by Stokes in 1902 after thinking 
about divergent series for more than half a century: 

The inferior term enters as it were into a mist, is hid- 
den for a little from view, and comes out with its coef- 
ficient changed. The range during which the inferior 
term remains in a mist decreases indefinitely as the 
[large parameter] increases indefinitely. 

The smoothing shows that the “range” referred to by 
Stokes (that is, the effective thickness of the Stokes hne) 
is of order y'Re T. 

Implementation of the full hyperasymptotic repeated 
re summation scheme envisaged by Dingle has been car- 
ried out in several ways. We (the present authors) inves- 
tigated one-dimensional integrals with several saddle 
points, each associated with an exponential and its 
corresponding asymptotic series. With each “hyper- 
series” truncated at its least term, this incorporated 
all subdominant exponentials and all associated Stokes 
phenomena; and the accuracy obtained far exceeded 
superasymptotics (figure 5) but was nevertheless lim- 
ited. 

It was clear from the start that in many cases unlim- 
ited accuracy could, in principle, be achieved with 
hyperasymptotics by truncating the hyperseries not at 
the smallest term but beyond it (although this intro- 
duces numerical stability issues associated with the 
cancelation of larger terms). This version of the hyper- 
asymptotic program was carried out by Adri Olde 
Daalhuis, who reworked the whole theory, introducing 
mathematical rigor and effective algorithms for com- 
puting Dingle’s terminant integrals and their multidi- 
mensional generalizations, and applied the theory to 
differential equations with arbitrary finite numbers of 
transition points. 

There has been an explosion of further develop- 
ments. Ecalle’s rigorous formal theory of resurgence 
has been developed in several ways, based on the Borel 
(effectively inverse-Laplace) transform. This converts 
the factorially divergent series into a convergent one, 
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Figure 5 The magnitude of terms in the first four stages 
in the hyperasymptotic approximation to Ai(4.326 . . . ) = 
4.496 ... x 10 -4 , i.e., F = 4z 3/2 /3 = 12, normalized so that 
the lowest approximation is unity. For the lowest approx- 
imation, i.e., no correction terms, the fractional error is 
£ » 0.01; after stage 0 of hyperasymptotics, i.e., optimal 
truncation of the series (superasymptotics), £ « 3.6 x 10 -7 ; 
after stage 1, £ a 1.3 x 10 -11 ; after stage 2, £ » 4.4 x 10 -14 ; 
after stage 3, £ a 6.1 x 10 -15 . At each stage, the error is of 
the same order as the first neglected term. 


with the radius of convergence determined by singu- 
larities on a Riemann sheet. These singularities are 
responsible for the divergence of the original series, 
and for integrals discussed above, they correspond to 
the adjacent saddle points. In the Borel plane, com- 
plex and microlocal analysis allows the resurgence link- 
ages between asymptotic contributions to be uncov- 
ered and exact remainder terms to be established. 
Notable results include exponentially accurate rep- 
resentations of quantum eigenvalues (R. Balian and 
C. Bloch, A. Voros, F. Pham, E. Delabaere); this inspired 
the work of the current authors on quantum eigen- 
value counting functions, linking the divergence of the 
series expansion of smoothed spectral functions to 
oscillatory corrections involving the classical periodic 
orbits. 

T. Kawai and Y. Takei have extended “formally exact,” 
exponentially accurate WKB analysis to several areas, 
most notably to Painleve equations. They have also 
developed a theory of “virtual turning points” and “new 
Stokes curves.” In the familiar WKB situation, with only 
two wavelike asymptotic contributions, several Stokes 
lines emerge from classical turning points, and (if they 
are nondegenerate) they never cross. With three or 
more asymptotic contributions, Stokes lines can cross 
in the complex plane at points where the WKB solu- 
tions are not singular. Local analysis shows that an 
extra, active, “new Stokes line” sprouts from one side 
only of this regular point; this can be shown to emerge 
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from a distant virtual turning point, where, unexpect- 
edly, the WKB solutions are not singular. This discovery 
has been explained by C. J. Howls, P. Langman, and A. B. 
Olde Daalhuis in terms of the Riemann sheet structure 
of the Borel plane and linked hyperasymptotic expan- 
sions, and independently by S. J. Chapman and D. B. 
Mortimer in terms of matched asymptotics. 

Groups led by S. J. Chapman and J. R. King have devel- 
oped and applied the work of M. Kruskal and H. Segur to 
a variety of nonlinear and partial differential equation 
problems. This involves a local matched-asymptotic 
analysis near the distant Borel singularities that gen- 
erate the factorially divergent terms in the expansion 
to identify the form of late terms, thereby allowing 
for an optimal truncation and exponentially accurate 
approach. Applications include selection problems in 
viscous fluids, gravity-capillary solitary waves, oscil- 
lating shock solutions in Kuramoto-Sivashinsky equa- 
tions, elastic buckling, nonlinear instabilities in pat- 
tern formation, ship wave modeling, and the seeking 
of reflectionless hull profiles. Using a similar approach, 
O. Costin and S. Tanveer have identified and quanti- 
fied the effect of “daughter singularities” that are not 
present in the initial data of partial differential equa- 
tion problems but that are generated at infinitesimally 
short times. In so doing, they have also found a formally 
exact Borel representation for small-time solutions of 
three-dimensional Navier- Stokes equations, offering a 
promising tool to explore the global existence problem. 

Additional applications include quantum transitions, 
quantum spectra, the Riemann zeta function high on 
the critical line, nonperturbative quantum field theory 
and string theory, and even the philosophy of rep- 
resenting physical theories describing phenomena at 
different scales by singular relations. 
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V.9 Financial Mathematics 

Rene Carmona and Ronnie Sircar 


The complexity, unpredictability, and evolving nature 
of financial markets continue to provide an enor- 
mous challenge to mathematicians, engineers, and 
economists in identifying, analyzing, and quantifying 
the issues and risks they pose. This has led to problems 
in stochastic analysis, simulation, differential equa- 
tions, statistics and big data, and stochastic control 
and optimization (including dynamic game theory), all 
of which are reflected in the core of financial mathe- 
matics. Problems range from modeling a single risky 
stock and the risks of derivative contracts written on 
it, to understanding how intricate interactions between 
financial institutions may bring down the whole finan- 
cial edifice and in turn the global economy: the prob- 
lem of systemic risk. At the same time, hitherto spe- 
cialized markets, such as those in commodities (met- 
als, agriculturals, and energy), have become more finan- 
cialized, which has led to our need to understand how 
financial reduced-form models combine with supply 
and demand mechanisms. 

In its early days, financial mathematics used to rest 
on two pillars that could be characterized roughly as 
derivatives pricing and portfolio selection [V.10]. 
In this article we outline its broadening into newer 
topics including, among others, energy and commodi- 
ties markets, systemic risk, dynamic game theory and 
equilibrium, and understanding the impact of algorith- 
mic and high-frequency trading. We also touch on the 
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2008 financial crisis (and others) and the extent to 
which such increasingly frequent tremors call for more 
mathematics, not less, in understanding and regulating 
financial markets and products. 

1 The Pricing and Development of 
Derivative Markets 

Central to the development of quantitative finance, as 
distinguished from classical economics, was the mod- 
eling of uncertainty about future price fluctuations as 
being phenomenological, rather than something that 
could be accurately captured by models of fundamen- 
tals, or demand and supply. The introduction of ran- 
domness into models of (initially) stock prices took 
off in the 1950s and 1960s, particularly in the work 
of the economist Paul Samuelson at MIT, who adopted 
the continuous-time Brownian motion based tools of 
stochastic calculus that had been developed by physi- 
cists and mathematicians such as Einstein in 1905; 
Wiener in the 1920s; Levy, Ornstein, and Uhlenbeck in 
the 1930s; and Chandrasekhar in the 1940s. Samuel- 
son, however, came to this mathematical machinery 
not through physics but via a little-known doctoral dis- 
sertation by Bachelier from 1900, which had formu- 
lated Brownian motion, for the purpose of modeling 
the Paris stock market, five years before Einstein’s land- 
mark paper in physics (“On the motion of small parti- 
cles suspended in a stationary liquid, as required by the 
molecular kinetic theory of heat”). Combined with the 
stochastic calculus developed by Ito in the early 1940s, 
these kinds of models became, and still remain, central 
to the analysis of a wide range of financial markets. 

The Black-Scholes paradigm for equity derivatives 
was originally introduced in the context of Samuelson’s 
model, in which stocks evolve according to geometric 
Brownian motions. If we consider a single stock for the 
sake of exposition, one assumes that the value at time 
t of one share is S t , and S t evolves according to the 
stochastic differential equation 

dSt = /aSt dt + crSt dW t , (1) 

with p representing the expected growth rate, and cr > 
0 the volatility. Here, (Wt)t ^ o is a standard Brownian 
motion. A contingent claim (or derivative security) on 
this stock is defined by its payoff at a later date T, called 
its maturity, and modeled as a random variable g whose 
uncertain value will be revealed at time T. The typical 
example is given by a European call option with matu- 
rity T and strike K, in which case the payoff to the buyer 
of the option would be § = {St~K) + , where we use the 


notation x + = max{x,0} for the positive part of the 
real number x. 

A desirable property of a model is to exclude arbi- 
trage opportunities, namely, the possibility of making 
money with no risk (often called a “free lunch”). If this 
property holds, then the value of owning the claim 
should be equal to the value of an investment in the 
stock and a safe bank account that could be managed 
by dynamic rebalancing into the same value as g at 
maturity T. Such a replicating portfolio would repre- 
sent a perfect hedge, mitigating the uncertainty in the 
outcome of the option payoff g. The remarkable dis- 
covery in the early 1970s of Fischer Black and Myron 
Scholes, and independently by Robert Merton, was that 
it was possible to identify such a perfect replicating 
portfolio and compute the initial investment needed to 
set it up by solving a linear partial differential equa- 
tion (PDE) of parabolic type. They also provided formu- 
las for the no-arbitrage prices of European call and put 
options that are now known as Black-Scholes formulas. 
Later, in two papers that both appeared in 1979, the 
result was given its modern formulation, first by Cox, 
Ross, and Rubinstein in a simpler discrete model and 
then by Harrison and Kreps in more general settings. 
Accordingly, if there are no arbitrage opportunities, the 
prices of all traded securities (stocks, futures, options) 
are given by computing the expectation of the present 
value (i.e., the discounted value) of their future payoffs 
but with respect to a “risk-neutral” probability measure 
under which the growth rate p in (1) is replaced by the 
riskless interest rate of the bank account. 

While pricing by expectation is one of the important 
consequences of the original works of Black, Scholes, 
and Merton, it would be misleading and unfair to 
reduce their contribution to this aspect, even if it is 
at the origin of a wave (in retrospect, it was clearly a 
tsunami) of interest in derivatives and the explosion 
of new and vibrant markets. Even though the origi- 
nal rationale was based on a hedging argument aimed 
at mitigating the uncertainties in the future outcomes 
of the value of the stock underlying the option, pric- 
ing, and especially pricing by expectation (whether the 
expectations were computed by Monte Carlo simula- 
tions or by PDEs provided by the Feynman-Kac for- 
mula), became the main motivation for many research 
programs. 

1.1 Stochastic Volatility Models 

It has long been recognized (even by Black and Scholes, 
and many others, in the 1970s) that the lognormal 
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distribution inherent in the geometric Brownian motion 
model (1) is not reflected by historical stock price data 
and that volatility is not constant over time. Market par- 
ticipants were pricing options, even on a given day t, as 
if the volatility parameter a depended upon the strike 
K and the time to maturity T - t of the option. 

To this day, two extensions of the model (1) have 
been used with great success to account for these styl- 
ized facts observed in empirical market data. They 
both involve replacing the volatility parameter by a 
stochastic process, so they can be viewed as stochas- 
tic volatility models. The first extension is to replace 
(1) by 

dSt = pSt dt + dtSt dIT t , (2) 

where the volatility parameter u is replaced by the 
value at time t of a stochastic process (at)t^o whose 
time evolution could be (for example) of the form 

dcr t = A((t - a t ) dt + y^/cftdWt (3) 

for some constants y (known as volvol), a (the mean 
reversion level), and A (the rate of mean reversion) and 
where (Wt)t>o is another Brownian motion that is typ- 
ically negatively correlated with (W t )t^o t° capture the 
fact that, when volatility rises, prices usually decline. 
Stochastic volatility models of this type have been (and 
are still) very popular. 

Having two sources of random shocks (whether or 
not they are independent) creates some headaches for 
the quants, as the no-arbitrage prices are now plenti- 
ful, and in the face of the nonuniqueness of derivative 
prices, tricky calibration issues have to be resolved. 
So rather than dealing with the incompleteness of 
these stochastic volatility models, a more minimalist 
approach was proposed to capture the empirical prop- 
erties of the option prices while at the same time 
keeping only one single source of shocks and, hence, 
retaining the completeness of the model. These mod- 
els go under the name of local volatility' models. They 
are based on dynamics given by Markovian stochastic 
differential equations of the form 

dSf = gSt dt + cr(t, St) dWt, 

where (t,s) ■— ait, s) is a deterministic function, which 
can be computed from option prices using what is 
known as Dupire’s formula in lieu of the geometric 
Brownian motion equation (1). The other stochastic 
volatility models are also the subject of active research. 

According to the Black-Scholes theory, prices of con- 
tingent claims appear as risk-adjusted expectations of 
the discounted cash flows triggered by the settlement 


of the claims. In the case of contingent claims with 
European exercises, the random variable giving the pay- 
off is often given by a function of an underlying Markov 
process at the time of maturity of the claim. Using 
the Feynman-Kac formula, these types of expectations 
appear as solutions of PDEs of parabolic type, show- 
ing that prices can be computed by solving PDEs. The 
classical machinery of the numerical analysis of linear 
PDEs is the cornerstone of most of the pricers of contin- 
gent claims in low dimensions. However, the increasing 
size of the baskets of instruments underlying deriva- 
tive contracts and the complexity of the exercise con- 
tingencies have limited the efficacy of the PDE solvers 
because of the high dimensionality. This is one of the 
main reasons for the increasing popularity of Monte 
Carlo methods. Combined with regression ideas, they 
provide robust algorithms capable of pricing options 
(especially options with American exercises) on large 
portfolios while avoiding the curse of dimensional- 
ity [1.3 §2] that plagues traditional PDE methods. More- 
over, the ease with which one can often generate Monte 
Carlo scenarios for the sole purpose of backtesting has 
added to the popularity of these random simulation 
methods. 

1.2 Bond Pricing and Fixed-Income Markets 

The early and mid-1990s saw growth in the bond 
markets (a sudden increase in the traded volumes in 
Treasury, municipal, sovereign, corporate, and other 
bonds), and the fixed-income desks of many investment 
banks and other financial institutions became major 
sources of profit. The academic financial mathematics 
community took notice, and a burst of research into 
mathematical models for fixed-income instruments fol- 
lowed. 

Parametric and nonparametric models for the term 
structure of interest rates (yield curves describing the 
evolution of forward interest rates as a function of the 
maturity tenors of the bonds) were successfully devel- 
oped from classical data-analysis procedures. While 
spline smoothing was often used, principal compo- 
nent analysis of the yield data clearly points to a 
small number of easily identified factors, and least- 
squares regressions can be used to identify the term 
structure in parameterized families of curves success- 
fully. Despite the fact that infinite-dimensional func- 
tional analysis and stochastic PDEs were brought to 
bear in order to describe the data, model calibration 
ended up being easier than in the case of the equity 
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markets, where the nonlinear correspondence between 
option prices and the implied volatility surface cre- 
ates challenges that, to this day, still remain mostly 
unsolved. 

The development of fixed-income markets was very 
rapid, and the complexity of interest rate deriva- 
tives (swaps, swaptions, floortions, captions, etc.) moti- 
vated the scaling-up of mathematical models from 
the mere analysis of one -dimensional stochastic dif- 
ferential equations to the study of infinite-dimen- 
sional stochastic systems and stochastic PDEs. Current 
research in financial mathematics is geared toward the 
inclusion of jumps in these models and the understand- 
ing of the impact of these jumps on calibration, pricing, 
and hedging procedures. 

1.3 Default Models and Credit Derivative Markets 

Buying a bond is just making a loan; issuing a bond is 
nothing but borrowing money. While the debt of the 
U.S. government is still regarded by most as default 
free, sovereign bonds and most corporate bonds carry a 
significant risk associated with the nonnegligible prob- 
ability that the issuer may not be able to make good 
on his debt and may default by the time the principal 
of the loan is to be returned. Unsurprisingly, models of 
default were first successfully included in the pricing 
of corporate bonds. 

Structural models of default based on the fundamen- 
tals of a firm and the competing roles of its assets 
and liabilities were first introduced by Merton in 1974. 
Their popularity was due to a rationale that was solidly 
grounded on fundamental financial principles and data 
reporting. However, murky data and a lack of trans- 
parency have plagued the use of these models for the 
purpose of pricing corporate bonds and their deriva- 
tives. 

On the other hand, reduced-form models based on 
stochastic models for the intensity of arrival of the 
time of default have gained in popularity because of 
the versatility of the intensity-based models and the 
simplicity and robustness of the calibration from credit 
default swap (CDS) data. Indeed, one of the many rea- 
sons for the success of reduced-form models is readily 
available data. Quotes of the spreads on CDSs for most 
corporations are easy to get, and the fact that they are 
plentiful contrasts with the scarcity of corporate bond 
quotes, which are few and far between. In fact, because 
the CDS market exploded in the early to mid-2000s, 
gauging the creditworthiness of a company is easily 


read off the CDS spreads instead of the bond spread 
(the difference between the interest paid on a riskless 
government bond and a corporate bond). 

A CDS is a form of insurance against the default of 
a corporation, say X. Such a CDS contract involves two 
counterparties, say Y and Z, the latter receiving a reg- 
ular premium payment from Y as long as X does not 
default and paying a lump sum to Y in the case of 
default of X before the maturity of the CDS contract. 
The existence of a CDS contract between Y and Z there- 
fore seems natural if the financial health of one of these 
two counterparties depends upon the survival of X and 
the other counterparty is willing to take the opposite 
side of the transaction. However, neither of the coun- 
terparties Y or Z who are gambling on the possibility 
of a serious credit event concerning X (both having dif- 
ferent views on the likelihood of default) need to have 
any financial interest, direct or indirect, in X. In other 
words, two agents can enter into a deal involving a third 
entity just as a pure bet on the creditworthiness of this 
third entity. While originally designed as a credit insur- 
ance, CDSs ended up increasing the overall risk in the 
system through the multiplication of private bets. As 
they spread like uncontrolled brush fires, they created 
an intricate network of complex dependencies between 
institutions, making it practically impossible to trace 
the sources of the risks. 

1.4 Securitization 

Investment is a risky business, and the risks of large 
portfolios of defaultable instruments (corporate bonds, 
loans, mortgages, CDSs) were clearly a major source 
of fear, at least until the spectacular growth of secu- 
ritization in recent years. A financial institution bun- 
dles together a large number of such defaultable instru- 
ments, slices the portfolio according to the different lev- 
els of default risk, forming a small number of tranches 
(say five), keeps the riskiest one (called the “equity 
tranche”), and sells the remaining tranches (known as 
mezzanine or senior tranches) to trusting investors 
as an investment far safer than the original portfolio 
itself. These instruments are called collateralized debt 
obligations (CDOs). Pooling risks and tranching them 
as reinsurance contracts to pass on to investors with 
differing risk profiles is natural, and it has been the 
basis of insurance markets for dozens of years at least. 
But, unlike insurance products, CDOs were not regu- 
lated. They enjoyed tremendous success for most of 
the 2000s; credit desks multiplied, and academics and 
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mathematicians tried to understand how practitioners 
were pricing them. Needless to say, no effort was made 
at hedging the risk exposure as not only was it not 
understood but there was no reason to worry since it 
seemed that we could only make money in this game! 
Attempts at understanding their risk structure intensi- 
fied after the serious warnings of 2004 and 2005, but 
they did not come early enough to seriously impact 
the onset of the financial crisis, which was most cer- 
tainly triggered, or at least exacerbated by, the hous- 
ing freeze and the ensuing collapse of the huge market 
in mortgage-backed securities (MBSs), which are com- 
plex CDOs on pools of mortgages. Parts II and III of 
the report of the Financial Crisis Inquiry Commission 
are an enlightening read for the connections between 
CDOs and the credit crisis. 

The overuse of credit derivatives, particularly in the 
mortgage arena, contributed massively to the finan- 
cial crisis. While the May 2005 ripple in the corpo- 
rate CDO market served as an early warning of worse 
things to come, its lessons were largely ignored as the 
risks abated. The attraction of unfunded returns on 
default protection proved too great, and the culture 
of unbounded bonus-based compensation for traders 
led to excessive risk taking that jeopardized numerous 
long-standing and once fiscally conservative financial 
institutions. 

In the midst of the crisis, some questioned the role 
that quantitative models played in motivating or jus- 
tifying these trades. A Wall Street Journal article from 
2005 highlights the practice of playing CDO tranches 
off against each other to form a dubious hedge. In some 
sense this was “quantified” using the Gaussian copula 
(and similar) models (particularly implied correlations 
and their successor, so-called base correlations). These 
models were developed in-house by quants hired by 
and answering to traders. The raison d'etre for these 
models was twofold: simplicity and speed. Unfortu- 
nately, the short cut used to achieve these goals was 
rather drastic, capturing a complex dependence struc- 
ture among hundreds of correlated risks using a sin- 
gle correlation parameter. Many (surviving) banks have 
since restructured their quant/mo deling teams so that 
they now report directly to management instead of to 
traders. 

Numerous media outlets and commentators have 
blamed the crisis on financial models, and called for less 
quantification of risk. This spirit is distilled in Warren 
Buffet’s “beware of geeks bearing formulas” comment. 
Indeed, the caveat about the use of models to justify a 


posteriori “foolproof” hedges is warranted. But this cri- 
sis, like past crises, only highlights the need for more 
mathematics and for quantitatively trained people at 
the highest level, not least at ratings and regulatory 
agencies. 

The real damage was done in the highly unquantified 
market for MBSs. Here the notions of independence and 
diversification through tranching were taken to ludi- 
crous extremes. The MBS desks at major banks were 
relatively free of quants and quantitative analysis com- 
pared with the corporate CDO desks, even though the 
MBS books were many times larger. The excuse given 
was that these products were AAA and therefore, like 
U.S. government bonds, did not need any risk analy- 
sis. As it turned out, this was a mass delusion willingly 
played into by banks, hedge funds, and the like. The 
quantitative analysis performed by the ratings agen- 
cies tested outcomes on only a handful of scenarios, in 
which, typically, the worst-case scenario was that U.S. 
house prices would appreciate at the rate of 0.0%. The 
rest is history. 

2 Portfolio Selection and Investment Theory 

A second central foundational pillar underlying mod- 
ern financial mathematics research is the problem of 
optimal investment in uncertain market conditions. 
Typically, optimality is with respect to the expected 
utility of portfolio value, where utility is measured by 
a concave increasing utility function, as introduced by 
von Neumann and Morgenstern in the 1940s. A major 
breakthrough in applying continuous-time stochastic 
models to this problem came with the work of Robert 
Merton that was published in 1969 and 1971. In these 
works Merton derived optimal strategies when stock 
prices have constant expected returns and volatilities 
and when the utility function has a specific conve- 
nient form. This remains among the few examples of 
explicit solutions to fully nonlinear Hamilton-Jacobi- 
Bellman (HJB) PDEs motivated by stochastic control 
applications. 

To explain the basic analysis, suppose an investor has 
a choice between investing his capital in a single risky 
stock (or a market index such as the S&P 500) or in 
a riskless bank account. The stock price S is uncertain 
and is modeled as a geometric Brownian motion (1). The 
choice (or control) for the investor is therefore over n t , 
the dollar amount to hold in the stock at time t, with his 
remaining wealth deposited in the bank, earning inter- 
est at the constant rate r. If Xt denotes the portfolio 
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value at time t, and if we assume that the portfolio is 
self-financing in the sense that no other monies flow in 
or out, then 

d X t = TT t ~~ +r(X t - TT t )dt 

is the equation governing his portfolio time evolution. 
We will take r = 0 for simplicity of exposition, and so, 
from (1), 

dA' t = /JTTf dt + CTTTf d W t . 

Increasing the stock holding tt increases the growth 
rate of X while also increasing its volatility. The 
investor is assumed to have a smooth terminal utility 
function U(x) on R + that satisfies the “usual condi- 
tions” (the Inada conditions and asymptotic elasticity) 

TT' ( x ) 

U’(0 + ) = oo, u'( oo) = 0, limx^-Ff < 1, 

X- X U(x) 

and he wants to maximize E{[/(At)}, his expected 
utility of wealth at a fixed time horizon T. To apply 
dynamic programming principles, it is standard to 
define the value function as 


V(t,x) = supE (U(X T ) I X t = x } , 


where the supremum is taken over admissible strate- 
gies that satisfy E{J 0 r 7T t 2 dt} < oo. Then V(t,x) is the 
solution of the HJB PDE problem 

V 2 


Vt 


iA 2 -^=0, V(T,x) = U (x), 

V XX 


where A := p/cr is known as the Sharpe ratio. Given the 
value function V, the optimal stock holding is given by 

^ = ~«2V X (t ’X t ] - 

o v XX 

Remarkably, Merton discovered an explicit solution 
when the utility is a power function: 

c/(x) = r y> 0 ’ y* 1 - 

Here, y measures the concavity of the utility function 
and is known as the constant of relative risk aversion. 
With this choice, 

x}-y 

— exp i - 

y 


V(t,x) 


i-y 

and, more importantly, 




b 

a 2 y 


X t . 


That is, the optimal strategy is to hold the fixed frac- 
tion p/(cr 2 y) of current wealth in the stock and the 
rest in the bank. As the stock price rises, this strategy 
says to sell some stock so that the fraction of the port- 
folio comprised of the risky asset remains the same. 


This fixed-mix result generalizes to multiple securities 
as long as they are also assumed to be (correlated) 
geometric Brownian motions. 

Since Merton’s work, the basic problem has been 
generalized in many directions. In particular, devel- 
opments in duality theory (or the martingale method) 
led to a revolution in thinking as to how these prob- 
lems should be studied in abstract settings, culminat- 
ing in very general results in the context of semimartin- 
gale models of incomplete markets. One of the most 
challenging problems was to extend the theory in the 
presence of transaction costs, as this required the fine 
analysis of singular stochastic control problems. 

Optimal investment still generates challenging re- 
search problems as models evolve to try to incor- 
porate realistic features such as uncertain volatility 
or random jumps in prices using modern technology 
such as forward-backward stochastic differential equa- 
tions, asymptotic approximations for fully nonlinear 
HJB PDEs, and related numerical methods. 

Early portfolio optimization was achieved through 
the well-known Markowitz linear quadratic optimiza- 
tion problems based on the analysis of the mean vec- 
tor and the covariance matrix. The use of increas- 
ingly large portfolios and the introduction of exchange- 
traded fund (ETF) tracking indexes (S&P 500, Russell 
2000, etc.) created the need for more efficient estima- 
tion methods for large covariance matrices, typically 
using sparsity and robustness arguments. This prob- 
lem has been the major impetus behind a significant 
proportion of the big data research in statistics. 

3 Growing Research Areas 
3.1 Systemic Risk 

The mathematical theory of risk measures, like value at 
risk, expected shortfall, or maximum drawdown, was 
introduced to help policy makers and portfolio man- 
agers quantify risk and define unambiguously a form 
of capital requirement to preserve solvency. It enjoyed 
immediate success among the mathematicians work- 
ing on financial applications. However, their dynami- 
cal analogues did not enjoy the same popular support, 
mostly because of their complexity and the challenges 
in attempting to aggregate firm-level risks into system 
behavior. 

Research on systemic risk began in earnest after the 
September 11 attacks in 2001. The financial crisis of 
2008 subsequently brought counterparty risk and the 
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propagation of defaults to the forefront. The analy- 
sis of large complex systems in which all the entities 
behave rationally at the individual level but which pro- 
duce calamities at the aggregate level is a very exciting 
challenge for mathematicians, who are now developing 
models for what some economists have called rational 
irrationality. 

3.2 Energy and the New Commodity Markets 

As for the models of defaults used in fixed-income 
markets, and subsequently in credit markets, mod- 
els for commodities can, roughly speaking, be divided 
into two distinct categories: reduced-form models and 
structural models. 

Commodities are physical in nature and are over- 
whelmingly traded on a forward basis, namely, for 
future delivery. So, as with fixed-income markets, a 
snapshot of the state of a market is best provided by 
a term structure of forwards of varying maturities. But 
unlike the forward rates, forward prices are actually 
prices of traded commodities, and as a consequence 
they should be modeled as martingales (pure fluctua- 
tion processes with no trend or bias). Despite this fun- 
damental difference, the first models for commodities 
were borrowed (at least in spirit) from the models devel- 
oped by sophisticated researchers in fixed-income mar- 
kets. However, the shortcomings of these ad hoc trans- 
plants are now recognized, and fundamental research 
in this area now focuses on structural models that are 
more in line with equilibrium arguments and the eco- 
nomic rationale of supply and demand. A case in point 
is the pricing of electricity contracts: figure 1 shows 
the time evolution of electric power spot prices in a 
recently deregulated market. Clearly, none of the math- 
ematical models used for equities, currencies, or inter- 
est rates can be calibrated to be consistent with these 
data, and structural models involving the factors that 
drive demand (like weather) and supply (hke means 
of production) need to be used to give a reasonable 
account of the spikes. 

Electricity production is one of the major sources of 
greenhouse gas emissions, and various forms of market 
mechanisms have been touted to control these harm- 
ful externahties. Most notable is the implementation of 
the Kyoto protocol in the form of mandatory cap-and- 
trade for CO 2 allowance certificates by the European 
Union. While policy issues are still muddying the final 
form that the control of CCD emissions will take in the 
United States, cap-and-trade schemes already exist in 



Figure 1 Historical daily prices of electricity from 
the PJM market in the northeastern United States. 

the northeast of the country (the Regional Greenhouse 
Gas Initiative) and, more recently, in California, giving a 
new impetus to theoretical research into these markets. 

The proliferation of commodity indexes and the dra- 
matic increase in investors gaining commodity expo- 
sure through ETFs that track indexes have changed the 
landscape of the commodity markets and increased 
the correlations between commodities and equity, and 
the correlations among commodities included in the 
same indexes. Figure 2 illustrates this striking change 
in correlations. These changes are difficult to explain 
if one relies solely on the fundamentals of these mar- 
kets. They seem to be part of a phenomenon, known 
as financialization of the commodity markets, that has 
been taking place over the last ten years and that is now 
being investigated by a growing number of economists, 
econometricians, and mathematicians. Also of great 
interest to theoretical research is the fact that financial 
institutions, including hedge funds, endowment funds, 
and pension funds, have realized that the road to suc- 
cess in the commodity markets was more often than 
not to rely on managing portfolios that included both 
physical and financial assets. Once more, this combi- 
nation raises new issues that are not addressed by tra- 
ditional financial mathematics or financial engineering. 
Finally, the physical nature of energy production led to 
the introduction of new financial markets in areas such 
as weather, freight, and emissions, the design, regula- 
tion, and investigation of which pose new mathematical 
challenges. 

3.3 High-Frequency Trading 

In the last few years, the notion of a single publicly 
known price at which transactions can happen in arbi- 
trary sizes has been challenged. The existence and 
the importance of liquidity frictions and price impact 
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Figure 2 Instantaneous dependence (/I) of the Goldman 
Sachs Total Return Commodity Index on S&P 500 returns. 


due to the size and frequency of trades are recog- 
nized as the source of many of the financial indus- 
try's most spectacular failures (Long-Term Capital Man- 
agement, Amaranth, Lehman Brothers, etc.), prompting 
new research into applications of stochastic optimiza- 
tion to optimal execution and predatory trading, for 
example. 

Another important driver in the change of direction 
in quantitative finance research is the growing role of 
algorithmic and high-frequency trading. Indeed, it is 
commonly accepted that between 60% and 70% of trad- 
ing is electronic nowadays. Market makers and brokers 
are now mostly electronic, and while they are claimed to 
be liquidity providers (the jury is still out on that one), 
occurrences like the flash crash of May 6, 2010, and 
computer glitches like those which took down Knight 
Capital, have raised serious concerns. Research into 
the development of limit order book models and their 
impact on trading is clearly one of the emerging topics 
in quantitative finance research. 

3.4 Back to Basics: Stochastic Equilibrium and 
Stochastic Games 

More recently, these tools have been adapted to prob- 
lems involving multiple “agents” optimizing for them- 
selves but interacting with each other through a mar- 
ket. These problems involve analyzing and computing 
an equilibrium, which may come from a market clearing 
condition, for example, or in enforcing the strong com- 
petition of a Nash equilibrium. There has been much 
recent progress in stochastic differential games. We 
mention, for example, recent works of Lasry and Lions, 
who consider mean-field games in which there are a 
large number of players and competition is felt only 
through an average of one’s competitors, with each 
player’s impact on the average being negligible. 


Problems arising from price impact, stability, liq- 
uidity, and the formation of bubbles remain of vital 
interest and have produced very interesting mathe- 
matics along the way. Not long after the 1987 crash 
there was much concern about the extent to which 
the large drop was caused or exacerbated by program 
traders whose computers were automatically hedging 
options positions, causing them to sell mechanically 
when prices went down, pushing them down further. 
In a continuous-time framework, this type of feedback 
model of price impact was initiated in the late 1990s, 
and since then there have been many influential stud- 
ies of situations in which an investor is not simply a 
“price taker” and where large stock positions are sold 
off in pieces to avoid large price-depressing trades: the 
epitome of an optimal execution algorithm. 

Further Reading 

The books by Fouque et al. and Gatheral describe recent 
research taking place with stochastic volatility mod- 
els. A textbook reference for Monte Carlo methods 
for financial problems is the book by Glasserman. For 
more on infinite-dimensional interest rate models, see, 
for example, the recent textbooks by Carmona and 
Tehranchi, and by Filipovic. Merton’s classic works on 
portfolio optimization are reprinted in his 1992 col- 
lection. The recent handbook edited by Fouque and 
Langsam provides many viewpoints on the analysis of 
systemic risk. For further reading on energy and com- 
modities markets, see the recent survey by Carmona 
and Coulon and the book by Swindle. The recent book 
by Lehalle and Laruelle discusses market microstruc- 
ture in the context of high-frequency trading. A recent 
survey of work on asset price bubbles (and subsequent 
crashes) can be found in the survey article by Protter. 
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markets and structural approaches to modeling electric- 
ity. In Energy Markets: Proceedings of the WPI Special Year, 
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V.10 Portfolio Theory 

Thomas J. Brennan, Andrew W. Lo, 
and Tri-Dung Nguyen 

1 Basic Mean-Variance Analysis 

Pioneered by the Nobel Prize-winning economist Harry 
Markowitz over half a century ago, portfolio theory is 
one of the oldest branches of modern financial eco- 
nomics. It addresses the fundamental question faced by 
an investor: how should money best be allocated across 
a number of possible investment choices? That is, what 
collection or portfolio of financial assets should be cho- 
sen? In this article, we describe the fundamentals of 
portfolio theory and methods for its practical imple- 
mentation. We focus on a fixed time horizon for invest- 
ment, which we generally take to be a year, but the 
period may be as short as days or as long as several 
years. We summarize many important innovations over 
the past several decades, including techniques for bet- 
ter understanding how financial prices behave, robust 
methods for estimating input parameters, Bayesian 
methods, and resampling techniques. 

A portfolio is a collection of financial securities, often 
called assets, and the return to an asset or a portfolio 
is the uncertain incremental percentage financial gain 
or loss that results from holding the asset or portfolio 
over a particular time horizon. If the price of an asset i 
at date t is denoted pu, then its return r^t+i between 
t and t + 1 is defined as 


where d , it + 1 denotes any cash payouts made by asset i 
between t and t + 1, such as a dividend payment. The 
return on a portfolio of assets is defined in a similar 
manner. 


Markowitz’s seminal idea was to choose an optimal 
portfolio using two key features of the distribution of 
a portfolio’s return: its mean and variance. For any tar- 
get level of variance, an allocation yielding the greatest 
mean (or expected) return should be chosen. Similarly, 
for any target level of expected return, an allocation 
yielding the lowest possible level of variance should be 
chosen. So far, this does not allow a unique portfolio to 
be selected since either the target variance or the tar- 
get mean needs to be specified in advance. However, 
this does already allow us to sketch a curve in mean- 
variance space corresponding to the characteristics of 
the portfolios that are efficient inasmuch as they sat- 
isfy Markowitz’s requirements. This curve is known as 
the mean-variance efficient frontier (see figure 1 for an 
example). 

Variance is often replaced by its square root, stan- 
dard deviation, resulting in an equivalent curve of effi- 
cient portfolios in the space defined by mean and stan- 
dard deviation. We refer to this latter curve as the 
efficient frontier. 

Consider a collection of n financial assets that are 
available as investment choices and let them be indexed 
by i = 1, . . . ,n, with r* denoting the return of asset i 
over the applicable investment horizon (we suppress 
the time subscript t to simplify notation). An initial 
investment of one dollar in i thus yields 1 + r,- dollars at 
the end of the period of the investment. The expected 
return of i is p, = Xs r i (s)p s , and the covariance of the 
returns of i and j is cry = XAn(s) - m)(rj{s) - pj)p s ■ 
Here we assume that the returns have finite distribu- 
tions and that all the different possible sets of returns 
r = (ri,...,r n ) are indexed by the parameter s, with 

r s = (m(5) r n (s )). The probability of state s is 

denoted by p s . We write p = [pdt and Z = [cry ]y. All 
vectors denote column vectors unless stated otherwise. 

An allocation of funds among the assets can be 
thought of as a vector to = [to/]; of weights, with to; 
representing the proportion of available funds invested 
in asset i. The weights are subject to the constraint 
e‘iv = 1, where e is a vector of all Is. This is necessary 
so that exactly 100% of the available funds are invested. 
In the general case, weights are permitted to be nega- 
tive, with the interpretation that these correspond to 
short sales of those assets. A short sale is a specific 
financial transaction in which an investor can sell a 
security that he does not own by borrowing it from a 
third party, such as a broker, with the promise to return 
it at a later date. Short sales allow investors who expect 
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Figure 1 Efficient frontier for the collection of eight assets 
with characteristics described in table 1. The dashed line is 
the constrained frontier that does not allow negative asset 
weights in portfolios. For each individual asset the standard 
deviation and expected return are indicated on the graph. 


an asset’s price to decline to benefit from this expecta- 
tion. When less generality is desired, more constraints 
can be added, such as the restriction that co; ^ 0 for 
all i, implying that no short sales are allowed. 

A portfolio formed with the weight vector co has 
expected return p 1 to and variance w l £w. Thus, to 
determine the efficient frontier, we need to solve the 
minimization problem 

minro T Ztu such that p T to = p p e T cu = 1 (2) 

for each value of the portfolio expected return p p . The 
solution weight vector to* will be the optimal portfolio 
corresponding to the target expected return p p . 

Figure 1 illustrates the efficient frontier for the set 
of eight assets with expected returns, standard devia- 
tions, and correlations listed in table 1. In the figure, 
as well as in what follows, we generally restrict our 
attention to the “upper branch” of the efficient fron- 
tier, meaning that we do not include portfolios on the 
frontier that have returns lower than the return of the 
global minimum-variance portfolio. For any such port- 
folios, a higher level of return is possible for the same 
amount of risk. 


Table 1 Expected returns, standard deviations, and corre- 
lations for a collection of eight assets. The asset numbers 
correspond to the named assets in figure 1. The values are 
annualized versions of the statistics appearing in Michaud 
(1998) and reflect historical data from 1978 through 1995. 
(ER, expected return; SD, standard deviation.) 
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Merton. The appropriate Lagrangian is 

£ = co T Zco + Ai(p T co - p p ) + A 2 (e T co - 1). (3) 


A calculation setting the gradient of £ equal to zero 
shows that the solution to the minimization problem 
is 


co 


* 


P P C B 
D 


A -£~t e 

D ’ 


(4) 


where A = ii T Z : p, B = n T Z 1 e, C = e T £ 1 e, and 
D = AC - B 2 . 


The optimal weight of (4) combined with the formula 
for the variance of the portfolio, namely w J Zco, allows 
us to calculate the minimum variance possible for a 
given level of expected return p p . Specifically, we have 

Cp p - 2Bp p + A 


(co*) T Zco* = 


D 


(5) 


The minimum variance is thus simply a quadratic func- 
tion of the level of expected return. Moreover, we can 
use (5) to see that the global minimum variance is 1/C 
and occurs at the return level p p = B/C. The efficient 
frontier can therefore be generated as the set of points 


f = |(cr,p): cr = 
= |(cr,p): p = 


JCp2 - 2£p + A B] 

VH cl 

B V-CHCcr 2 - 1) 1 

c + c a> VC 


}■ 


( 6 ) 


1.1 Analytical Solutions 

We can find an exact analytic solution to the problem 
(2). The method of lagrange multipliers [1.3 §10] is 
applicable, and its use in this context was pioneered by 


1.2 Inequality Constraints 

It is often desirable to add further constraints to the 
optimization problem presented in (2). For example, it 
may be required that the weight of a particular asset 
be exactly 10%, that all elements of co be positive, or 
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that no one element of w be greater than 50%. These 
constrained portfolio selection problems do not gener- 
ally have simple closed-form solutions. Instead, numer- 
ical methods must be employed to compute optimal 
portfolios. 

To the extent that more constraints are stated in 
the form of equalities, the Lagrange multiplier for- 
mula in (3) can be extended to have additional appro- 
priate terms. If some of the constraints are inequal- 
ities, then the method of Lagrange multipliers can 
be extended using the karush-kuhn-tucker condi- 
tions [IV.ll §2]. 

In practice, computer software packages are used to 
calculate frontiers with constraints beyond the basic 
requirement that e T w = 1. In figure L, we illustrate a 
portion of the efficient frontier, as well as an example 
of a constrained frontier, for the collection of assets 
with characteristics described in table 1. We computed 
the points on the efficient frontier analytically, but 
those on the constrained frontier were computed using 
quadratic programming software. 

1.3 Finding the “Best” Portfolio 

We have not yet introduced a way to choose one “best” 
portfolio for an investor from among all those on the 
efficient frontier. To do so, we need to know something 
about the investor’s value function, i.e., how he ranks 
different combinations of risk and return. In this sec- 
tion, we describe several commonly used methods for 
an investor to rank different risk-return profiles. We 
will refer to these ranking methodologies in subsequent 
sections. 

The Minimum-Variance Portfolio 

In an extreme case, for an investor who cares only about 
risk and not at all about expected return, the port- 
folio with the lowest level of risk should be chosen. 
As we saw in section 1.1, this minimum value corre- 
sponds to a variance of 1/C or a standard deviation 
of 1/VC and is displayed in figure 2. The correspond- 
ing set of weights can also be easily derived by solving 
the Lagrange multiplier problem corresponding to the 
Lagrangian £ = cu T £cu + - 1). The resulting 

portfolio weight is seen to be oimin = (1/C)£^ 1 e. 

Standard Value Functions 

More realistically, an investor may assign higher value 
to portfolios with higher expected return and lower 


value to those v\dth higher risk. A commonly used 
simple value function along these lines is 

Vy(Op,pp) = Pp - jyoj. (7) 

In this case, the value of a portfolio with parameters q p 
and <T P is a linear function of expected return (p p ) and 
variance icr 2 ), and y is an investor-specific parameter 
indicating the investor’s tolerance for risk. This value 
function is most appropriate when p p and cr p are rela- 
tively small, as may happen when the investment time 
horizon is short. For longer time periods, more complex 
functions are generally needed. These value functions 
are often derived from more primitive assumptions 
about an investor’s utility for wealth, U(W). By set- 
ting W = Woiv T r, where Wo is initial wealth, the value 
function V is given by E[[/(fVo«i T r)] according to the 
axioms of Von Neumann and Morgenstern’s expected 
utility theory. 

For the special case of the value function defined in 
(7), we can find the unconstrained optimal portfolio 
weights tvv and determine the corresponding highest 
achievable value using an appropriate Lagrangian. We 
find that 

wv = —Z^ 1 p+ - — 

y yc 

The expected return and risk at this point of optimal 
value are 


bp = 


C 


D 

yc 


and cr. 


p 


D + y 2 


Sharpe Ratio 


An investor may also look at the reward-to-variability 
ratio represented by a portfolio, i.e., the expected 
excess return he receives from the portfolio divided by 
the risk the portfolio represents. This ratio is referred 
to as the Sharpe ratio and can be written as 


S (0p, bp) 


bp - n 

°p 


where cr p and p p are the risk and expected return of 
the portfolio, respectively, and where rf is the risk-free 
rate that the investor would receive if he did not invest 
in the portfolio. By subtracting the risk-free rate, we are 
measuring the incremental reward in excess of the risk- 
free rate that the investor receives by investing in the 
portfolio. 

Using the characterization of points on the efficient 
frontier given in (6), we can calculate the optimal Sharpe 
ratio and corresponding portfolio weights. Specifically, 
we can express cr p , and hence S, as a function of q p , 
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Figure 2 Optimal points on the efficient frontier for the 
collection of eight assets with characteristics described in 
table 1. The maximum Sharpe ratio is computed assuming a 
risk-free rate of rf = 6.96%, consistent with Michaud (1998), 
and the maximum value is calculated using a tolerance for 
risk of = 4. 

and we then maximize S as a univariate function of /j p 
and find that 

■Smax = 'JC' 1 ' i ~ 2BV{ + A 

for values of rf less than the return of the minimum- 
variance portfolio (see figure 2). The portfolio weights 
are given by (4), with 

A - Brf 
” B-Cri’ 

namely 

COSharpe = B _ Cr{ Z ^^ ~ r ^ ' 

1.4 Connections to the Capital Asset Pricing Model 
(CAPM) 

The mean-variance framework developed by Mark- 
owitz was fundamental to the development of the 
CAPM. In the 1950s and 1960s, Tobin, Sharpe, and 
Lintner derived the equilibrium implications under the 
assumption that all investors held efficient portfolios, 
and this led to the capital market line , the line con- 
necting the risk-free rate on the expected-return axis 
with the tangency portfolio on the efficient frontier in 
mean-standard deviation space, which is the same as 
the optimal Sharpe ratio portfolio derived above. Under 
the assumptions of the model, all investors hold a com- 
bination of positions in this tangency portfolio and the 
risk-free asset. 


In the CAPM, aggregate positions in the risk-free 
asset net to zero, and the tangency portfolio repre- 
sents the aggregate position in risky assets across all 
investors. Accordingly, this portfolio is referred to as 
the market portfolio. In the CAPM world, the expected 
return, pu for a particular asset i can be expressed in 
terms of the expected return and risk of the market 
portfolio, as well as the covariance of the asset with 
the market portfolio. The precise relationship is 

Pi = n + - r f ), 

where pmkt is the expected return of the market port- 
folio; (Si = co v(r,-, r m kt)/cr^ lkt ; r,- and r m kt represent the 
returns on asset i and the market portfolio, respec- 
tively; and crmkt represents the standard deviation of 
the market portfolio. The return ri can be written as 

n = n + Pi (rmkt - r f ) + e ; , 

where £i is a stochastic variable that is uncorrelated 
with rmkt and has zero mean. The risk represented by 
the £i is known as idiosyncratic risk. In matrix notation, 
the expected return vector and the covariance matrix 
of asset returns under the CAPM assumptions are thus 

PCAPM = ne + piPrakx ~ rf), 

£capm = cr^pp T + T2 e , 

where p is the vector of p values for the assets, and 
where fl e is the covariance matrix for the idiosyncratic 
risk. 

Efficiency of the Market Portfolio 

The importance of the mean-variance efficiency of the 
market portfolio was recognized early on by many 
authors and led to a series of debates on the testable 
implications of the CAPM. Markowitz has argued that 
empirical deviations from the CAPM are not surprising 
in light of the counterfactual assumptions on which 
the CAPM is based. In particular, he observes that: 
“When one clearly unrealistic assumption of the CAPM 
is replaced by a real-world version, some of the dra- 
matic CAPM conclusions no longer follow.” An exam- 
ple is the fact that unlimited borrowing and lending at 
identical interest rates is not possible in practice, and 
this limitation implies that the market portfolio need 
not be mean-variance efficient in equilibrium. 

Impossible Frontiers 

If an efficient frontier contains no portfolio with all 
positive weights, it is incompatible with the CAPM 
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and is defined to be impossible because, by defini- 
tion, the market portfoiio must have a positive weight 
for each asset, where the weight is proportional to 
the market capitalization of that asset. Brennan and 
Lo have demonstrated that, as the number of assets 
grows large, all efficient frontiers are almost surely 
impossible for a randomly drawn (v\ith respect to Haar 
measure) covariance matrix. This result explains the 
near-universal disdain with which professional port- 
folio managers regard standard mean-variance opti- 
mization techniques: the vast majority of them are con- 
strained to hold long-only portfolios, and hence an 
impossible frontier is, in fact, literally impossible for 
them to implement. 

2 Techniques for Practical Implementation 

In practice, we generally do not know the exact nature 
of the distributions of returns for the assets we can 
invest in. In fact, we do not even know the exact val- 
ues of the inputs required for mean-variance optimiza- 
tion, i.e., p and E. Instead, we must find ways to esti- 
mate these quantities and then use the estimates when 
carrying out optimizations. 

In this section we discuss several methods for the 
practical determination of optimal portfolios. In sec- 
tion 2.1 we describe the simple approach of using his- 
torical data to compute unbiased statistics under the 
assumption that asset returns follow a stable distribu- 
tion over time. In section 2.2 we detail several meth- 
ods for reducing noise, with a focus on estimating the 
covariance matrix, E, including methods that incorpo- 
rate theoretical predictions for the structure of E. In 
section 2.3 we describe Bayesian methods that allow 
for better estimation of p, as well as E, and that also 
allow for incorporation of investor beliefs and theoret- 
ical models for the nature of asset returns. In sec- 
tion 2.4 we survey improved and more robust methods 
for selecting optimal portfolios. 

2.1 Unbiased Estimators Using Historical Data 

If we assume that each period’s data represent an inde- 
pendent draw from a stable process governing asset 
returns, we can treat the observed historical returns as 
a sample from which we can estimate the desired statis- 
tics p and E. For T observations of historical returns, 
unbiased estimators of p and E are 

Pi = \ X r it , E = (9) 

1 t = l 1 1 


where r, t is the observation of the return on i in period 
t, and H = [(r, t - Pt)]it is a T xn matrix. An unbiased 
estimator of E is Z _1 , with 


The estimator E may be more appropriate than E when 
the object of interest is E instead of E, as in the 
formulas for the values A, B, and C in (4). 

2.2 Covariance Matrix Estimation 

Unbiased estimates suffer from estimation error; we 
describe several techniques designed to provide bet- 
ter estimates, with a particular focus on the covariance 
matrix, E. These include methods for the reduction of 
noise as well as for the incorporation of theoretical 
results regarding the structure of E. 

Factor Analysis 

The returns of the n available assets will generally not 
be independent. In fact, the returns may be driven in 
large part by a small number of common factors. If 
this is the case, the covariance matrix, E, is less com- 
plex and estimators taking this into account may be 
expected to contain less noise. It may also be assumed 
that the common factors determine the risk-free vec- 
tor, p, but for expositional purposes we allow p to 
remain fully general and focus solely on reducing the 
complexity of E. 

The return vector for the n assets takes the form 
r = p + V J A + e, 

where p is a vector of constants; A is a stochastic vec- 
tor of nf factors; V is a constant rif x n matrix of fac- 
tor loadings; and e is a stochastic vector of residual 
returns for the n assets, each having zero mean. The 
factors represented by A are generally thought of as 
historically observable aggregate market variables, and 
various economic models give rise to factor structures 
for asset returns. For example, the CAPM yields a fac- 
tor model with rif = 1 in which the single factor is the 
market-value-weighted average of all asset returns. 

Because we allow p to remain completely general, the 
factors represented by the elements of A may each be 
assumed to have zero mean. This assumption involves 
no loss of generality because an adjustment to p can be 
made, if necessary, to ensure that the expected returns 
of the factors are zero. In addition to having zero mean, 
it is often assumed that A follows a multivariate nor- 
mal distribution with covariance matrix E\. The vector 
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of residual returns e is also assumed to follow a multi- 
variate normal distribution, with covariance matrix Z e , 
and to be independent of A. These normality assump- 
tions do result in loss of generality of the distribu- 
tion of returns, but they greatly simplify the analysis 
to follow. Indeed, under these assumptions r follows a 
multivariate normal distribution, with 

r = ii + V J A + e~ N(p, V J Z A V + Z e ). 

We use the notation to denote a multivariate 

normal distribution with mean specified by the first 
argument and covariance matrix specified by the sec- 
ond argument. 

An estimator Z,\ of the covariance matrix of the 
factors can be found using historical data for factor 
returns with a methodology similar to that described in 
section 2.1. Estimators for the matrix of factor loadings 
V and the covariance matrix of the error terms Z e can 
be obtained using historical data and asset-by-asset lin- 
ear regressions. The regression for the ith asset yields 
estimators both for the rif elements in the ith column 
of V and for the variance of the ith element of e, as 
well as for the ith element of ii. An estimator for the 
full covariance matrix is, therefore, 

■^factor = V' 1 Z\ V + Z t . 

Covariance Shrinkage 

To reduce the estimation error of the covariance matrix, 
Z, one can take a weighted average of Z and a known 
covariance matrix F, a process known as shrinking the 
covariance matrix Z toward the target matrix F. The 
resulting revised estimator for the covariance matrix is 

Zlw = «F + (1 - cx)Z. 

This is a special case of Bayesian shrinkage estimators 
in which F plays the role of a prior and the posterior is 
given by Zlw (see section 2.3). There are various possi- 
bilities for the choice of F, but the basic motivation is 
to select an F that has a known structure that is a plau- 
sible alternative to Z. In this way, the shrinkage process 
effectively reduces estimation error while still keeping 
a portion of the characteristics of the unbiased estima- 
tor Z. One possible choice for F is the CAPM covari- 
ance matrix, Zcapm, described in (8), with the assump- 
tion that the covariance matrix £l e for the idiosyncratic 
return components is diagonal. 

To determine the appropriate <x for the shrinkage 
procedure, Ledoit and Wolf, who first applied shrink- 
age estimation to covariance-matrix estimation, derive 


a consistent estimator for a that minimizes the norm: 

|| («F + (1 - tx)Z) -Z\\ 2 , 

where ||M|| 2 is defined to be the sum of the squares of 
all the entries of a matrix M. 

Random-Matrix Theory 

Simple estimators for Z based upon historical data 
become less reliable as the ratio of observation periods 
to assets (g = T/n) decreases. In the extreme case in 
which g < 1, the estimator in (9) is degenerate because 
the rank of H is less than n. Laloux, Cizeau, Potters, and 
Bouchaud have addressed the problem of noisy estima- 
tion when g is not much larger than 1 by arguing that 
Z should behave like a random matrix in such cases. 
To the extent that it exhibits behavior other than that 
of a random matrix, there should be actual informa- 
tion present, and this insight leads to a procedure for 
purging the matrix of its random noise. 

Let H r be a T x n matrix with elements indepen- 
dently drawn from a normal random distribution with 
mean zero and standard deviation cr r . The matrix M r = 
Hj H r follows a Wishart distribution, and as T — oo, 
n — oo, and the ratio g = T/n remains constant, 
the eigenvalues of M r asymptotically all lie within an 
interval [A , A+ ], where 

A ± = <r r 2 (l + l/q ± 2^1/q). 

If a matrix has, with only a few exceptions, eigen- 
values that lie in this range, then it may be argued that 
the outliers correspond to the information content of 
the matrix while the other eigenvalues correspond to 
random noise. 

Instead of focusing on the estimated covariance 
matrix Z directly, it is more convenient to consider the 
corresponding correlation matrix C because random- 
matrix theory [IV.24] is most easily applied to cases 
in which all varianc es are e qual. C is defined by its 
entries Cij = Zij/JZijZjj and is itself a covariance 
matrix for the returns of n assets but with each return 
rescaled so it has unit variance. 

Let C = QXQ 1 be an eigendecomposition of C 
with the eigenvalues on the diagonal of £ and the 
corresponding eigenvectors in the columns of Q. We 
compare the distribution of the eigenvalues A; to the 
theoretical distribution predicted for a random matrix 
M r = Hj Hr, where the elements of H r are drawn 
independently from a normal distribution with unit 
variance. The eigenvalues may be separated into two 
groups by using a suitable method, e.g., by specifying 



654 


V. Modeling 


a threshold, crm > 1, such that any eigenvalue larger 
than crmA+ is deemed to convey information, with the 
rest deemed to be random noise. 

The correlation matrix can be cleaned by replacing 
eigenvalues corresponding to noise with the average of 
such eigenvalues and leaving the other eigenvalues as 
they are. This set of cleaned eigenvalues can be used to 
create a cleaned covariance matrix in two steps. First, 
define C to be Q£Q T , where £ is the diagonal matrix 
with diagonal entries equal to the cleaned eigenvalues. 
Second, define the cleaned covariance matrix by 

Zrm = 

This covariance matrix can now be used along with p 
as a basis for portfolio optimization. 

It is important to underscore that the approach for 
cleaning a covariance matrix using the theory of ran- 
dom matrices is most appropriate when the value of 
q = T / n is not much larger than 1 . 

Nearest Correlation Matrix 

A problem that is related to the estimation of a covari- 
ance matrix is that of computing a correlation matrix 
that is closest in some metric to a given matrix. This 
problem arises routinely in applications of portfolio 
optimization in which financial analysts wish to impose 
their priors by altering various entries of the correla- 
tion matrix to conform to their beliefs, e.g., that the 
correlation between stock A and stock B is 0. Arbitrary 
changes to elements of a bona fide correlation matrix 
can easily violate positive-semidefiniteness, implying 
negative variances for certain portfolios. Nonpositive- 
semidefiniteness can also arise if elements of the cor- 
relation matrix are estimated individually rather than 
via a matrix estimator. 

Under the Frobenius norm, a unique solution to the 
nearest correlation matrix (NCM) problem exists, and 
Higham has shown that it can be computed via an 
alternating projections algorithm that projects onto 
the space of matrices with unit diagonal and the cone 
of symmetric positive-semidefinite matrices. Although 
this algorithm is guaranteed to converge, it does so 
at a linear rate, which can be slow for large matrices. 
Using the dual of the NCM problem, it is possible to 
achieve global quadratic convergence by applying New- 
ton’s method, as shown by Qi and Sun. The NCM prob- 
lem has also been extended to the case where the true 
correlation matrix is assumed to have a k-factor struc- 
ture, where k is much less than the dimension of the 
correlation matrix. 


Finding a Possible Frontier 

It is possible to modify a covariance matrix so that the 
corresponding efficient frontier is possible, rather than 
impossible, in the sense described in section 1.4. The 
tangency portfolio of the CAPM should satisfy tUmia = 
(1/(B - Crf))Z _1 (^ - rfe), up to scaling by a positive 
constant. If this equality does not hold, then an adjust- 
ment to £ can be made to restore the equality. This 
approach was introduced by Brennan and Lo and is 
related to the Black-Litterman method of asset alloca- 
tion with prior information, which is described further 
in section 2.3. 

Brennan and Lo’s covariance matrix, Z p0S s, has the 
desired property that tUmkt = (1 /(£ - - 

rfe), and it is constructed to be the matrix requir- 
ing the least amount of change to the original covari- 
ance matrix Z. Specifically, the nature of their pro- 
posed change alters Z~ x only with respect to the one- 
dimensional vector space spanned by p - rfe, and the 
product of Z' 1 with any vector orthogonal to p - rfe is 
unaffected. Brennan and Lo argue that this covariance 
matrix should be used by those whose best estimate of 
the covariance matrix is otherwise Z but who also have 
a strong conviction that the CAPM must hold and that 
p and tOmkt are, in fact, the correct expected returns 
and market weights. 

2.3 Bayesian Methods 

Estimators for the risk-free vector, ft, are often con- 
sidered particularly problematic because of their large 
estimation errors. To improve the estimation of ft, 
as well as Z, techniques applying bayes’s theorem 
[V.l 1 §1] have been proposed. The basic idea is that 
an investor specifies certain prior information about 
the nature of asset returns, updates this prior with 
additional information and observations, and finally 
obtains a posterior distribution for r. Improved esti- 
mates for ft and Z are then recovered from this pos- 
terior distribution. In general, there is no restriction 
on the nature of the possible prior or additional infor- 
mation used in Bayesian methods. For the purposes of 
our discussion, we will restrict attention to the simple 
assumption that asset returns follow a normal distri- 
bution, so that r ~ Nip, Z). The values of p and Z are 
unknown a priori, but the additional information and 
Bayesian updating procedure v\411 allow estimates for 
these parameters to be made. 

The additional information used throughout our dis- 
cussion consists of a list of m separate pieces of 
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information, each specifying a probability distribution 
for the value of a linear transformation of p. For each 
j = 1, . . . , m, the applicable transformation is specified 
by a kj xn matrix Pj, and the jth piece of informa- 
tion is the assumption that Pyp ~ !H(Q_j, Sj ) for some 
kj -dimensional vector Q_j and some kj x kj covariance 
matrix Sj. This information specifies that the transfor- 
mation of p given by Pyp is normally distributed with 
mean Q ( and covariance matrix Sj. The value of kj may 
vary with j. 

With the prior and additional information in hand, we 
can apply Bayes’s theorem for continuous distributions 
to obtain a posterior distribution for asset returns. The 
procedure is to evaluate the integral with respect to p 
of the probability density for r, with unknown values 
of p and E, multiplied by the product of all the prob- 
ability densities corresponding to the additional infor- 
mation. The result is a posterior distribution for r that 
is normally distributed with mean and covariance given 
by 

( m .-im 

I P]S]'Pj) S P]SJ 1 Qj (10) 

j = i j = i 


and 


-post 


/ / m \-U-i 

z(/„ - (/„ + X PjSJ'PjZ) ) , 


where I n denotes the nxn identity matrix. Note that 
the expression for Z pos t still involves the unknown 
covariance matrix E. Additional information can be 
assumed about E, and a further Bayesian posterior dis- 
tribution can be calculated to eliminate the dependence 
on E. However, for our purposes we wall simply replace 
E with the unbiased estimator E derived from histor- 
ical returns, as discussed in section 2.1. A shortcut of 
this sort is typical in many applications of Bayesian 
analysis because elimination of E generally involves 
more complicated integrals. 

There is an alternate derivation for the expression of 
p post from (10): p p0 st is the value of p that minimizes 
the function 

m 

J(p)= X WPjit-QjW 2 ^, 

3 = 1 J 

where we have used the notation ||v||jy = v T Wv. That 
is, p P ost is the point most “compatible” with the con- 
straints Pjp = Q_j, with the uncertainty around each 
constraint specified by Sj. To see that p pos t indeed min- 
imizes J , compute the gradient of J and solve for 
the value of p that makes the gradient equal to 0. The 


resulting formula is the same as the formula for p pos t 
in (10). 


The Grand Mean 


Bayesian techniques can be used to determine a poste- 
rior mean and covariance matrix based on a combina- 
tion of historical information and information about a 
grand mean, an n-dimensional vector with all compo- 
nents equal to some real number g. The assumption is 
that asset returns will ultimately tend toward a com- 
mon average value and, by incorporating this tendency 
into a Bayesian analysis, it may be possible to obtain 
better values for p pos t and Z pos t- 

Jorion uses a total of m = T + 1 pieces of additional 
information to update the prior that r is normally dis- 
tributed, where T is the sample size of the available 
return data. The first T items are based on the his- 
torical data observations and are of the form p = ry 
for 1 < J < T, where rj is the / 1 h observed return, 
with uncertainty E, the historical covariance matrix. 
The final item has the form p = ge, the vector with 
all elements equal to the common grand mean, with 
uncertainty specified by (1/A)Z for a suitable A > 0. 

The appropriate value of A is estimated as 

A n+2 

(p - g 0 e) T E- 1 (ji - goe) ’ 

where go is an estimate for the expected return of the 
minimum-risk portfolio. After performing the neces- 
sary Bayesian updating, Jorion finds estimators for the 
expected return and covariance that depend only on 
historical data, namely 


Pi 


T 

T + A 


P + 


A 

T + A 


goe, 


£j ~ ( 1+ tT\) £+ T(T+\ + A){e^e)- 
The values p j and Ej tend toward p and E as T — ■ co , 
but for smaller values of T there are “correction” terms 
that take on a greater weight. 


Market Equilibrium and Investor Beliefs 

One can approach portfolio optimization by starting 
from a neutral market-implied set of expected returns 
and allowing investors to overlay their views to inform 
and modify these values. This method avoids the well- 
known difficulty of correctly predicting future returns 
from historical data and also provides a mechanism 
through which investors can express views different 
from those implied by history or the market. 
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A Bayesian technique along these lines was devel- 
oped by Black and Litterman, who started with the 
usual prior that the distribution of r is normally dis- 
tributed but assumed that the mean and covariance 
matrix are unknown. This prior is then updated with 
two types of additional information. The first is that 
the value of p has a distribution centered at p m kt, 
the expected return vector implied by the market, dis- 
cussed further below. The second is information pro- 
vided by the investor regarding his beliefs about the 
distribution of p. Using all this additional information, 
a posterior normal distribution for r is computed. 

To determine the expected return vector implied by 
the market, assume that the CAPM holds and that the 
tangency portfolio is given by to nl kt = ( 1 / y )^T 1 ( p - 
tye), where y is a constant that can be interpreted as 
the level of investor risk tolerance. Because the port- 
folio of market weights is easily observed, it is useful 
to rewrite this formula as an expression for p and to 
define this to be the value of p implied by the mar- 
ket, namely p n ikt = ue + yZaJmkt- The value y can be 
estimated using historical data as the level of risk tol- 
erance that is compatible with the risk in the market 
portfolio and the excess return to the market portfolio 
over the risk-free rate. Black and Litterman assume that 
the uncertainty around Pmkt is tZ, where t is a small 
positive number. The uncertainty is therefore assumed 
to be proportional to the covariance matrix for returns, 
but small. The exact value of t should be chosen in such 
a way as to reflect the uncertainty in the mean esti- 
mator Pmkt- The value of t should generally be close 
to zero to reflect the idea that the uncertainty in the 
market-implied mean return vector should not be very 
large. 

The second step of Black and Litterman’s method 
is the incorporation of specified investor beliefs. An 
investor is allowed to express a number of views of the 
form 

Pk.lPl + ■ ■ ■ + Pk.nPn = dk + e k , 
where fc = runs through the list of views 

expressed. The quantity e k represents a normally dis- 
tributed random variable with zero mean, reflecting the 
uncertainty in the fcth investor belief. The pk.i and q k 
are real numbers, and p, represents the expected return 
for the ith asset. This set of beliefs can be written 
compactly in matrix form as 

PmvP = Qinv + Cinv, (11) 

where P inv = [pk,ilk,i is a K x n matrix, Qm V = [q k ] k 
is a k-dimensional vector, and £j nv is a multivariate 


normal random variable v\ith zero mean and diagonal 
covariance matrix L2i nv . 

To combine the investor beliefs of (11) with the 
market-implied returns and determine a posterior 
distribution for r, one can follow the methodology 
described at the start of section 2.3 with m = 2 pieces 
of information. To incorporate market-implied returns, 
they set Pi = I n , Qi = Pmkt, and Si = t Z. To incorpo- 
rate investor beliefs, they set P2 = Pinv, Q2 = Qinv, and 
S 2 = 12inv- They then find that the posterior mean is 

Pbl = ((tZV 1 + pj av n^ v p inv r l 

X ((t~Z) ^ Pmkt + P invQnvQinv * ■ 
This is the expected value of r that is most consis- 
tent (subject to the specified levels of uncertainty) with 
both the expected returns implied by the market and 
investor beliefs. 

2.4 Other Approaches and Metrics 

It is possible to extend the basic framework of mean- 
variance optimization by using measures of risk and 
reward other than the mean vector, p, and the covari- 
ance matrix, Z. It is also possible to find additional 
ways to implement the basic mean-variance frame- 
work beyond the methods already described. In the 
remainder of this section we describe a few of these 
approaches. 

One-Sided Risk Measures 

Some of the most common alternative measures of 
portfolio risk are “one sided,” in that they focus on 
downside risk rather than symmetric risk around an 
expected return. Examples include value at risk (VaR) 
and shortfall risk. To measure VaR, an investor must 
specify a threshold 0 < 0 < 1. The VaR is an amount 
such that the probability that portfolio losses will equal 
or exceed such an amount is exactly equal to 0. Thus, 
an investor knows that with probability 1-0 losses will 
not exceed the VaR level. To measure shortfall risk, an 
investor must specify a benchmark, b, relative to which 
performance can be measured. The shortfall risk is the 
probability that the portfolio return will fall below b 
multiplied by the average amount by which the port- 
folio return falls below b conditional on being below 
b. The shortfall risk thus provides an investor with an 
expected value of downside exposure. 

By replacing variance with a one-sided measure in 
choosing optimal portfolios, an investor is able to 
control his worst-case scenarios to within a certain 
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confidence threshold (for VaR) or his average loss below 
a benchmark (for shortfall risk). A difficulty with using 
these measures, however, is that they do not yield 
closed-form expressions and they are less intuitive. 
Also, from a computational perspective, nonlinear opti- 
mization will generally be necessary to find optimal 
portfolios with respect to these measures, which is far 
less efficient than the linear and quadratic program- 
ming algorithms applicable to standard mean-variance 
optimization problems. Nonetheless, despite the addi- 
tional computational complexity, it is possible to opti- 
mize relative to these alternative measures of risk, 
and doing so may be desirable for investors with pref- 
erences or asset-return dynamics that are especially 
asymmetric. 

Resampling the Efficient Frontier 

A technique known as resampling (which is closely 
related to the bootstrap technique in statistics) can be 
used to determine portfolio weights. The idea is to 
smooth out errors arising from uncertainty in estima- 
tors for p and Z by generating a large number of alter- 
native possibilities for these values from a single data 
set, constructing a resampled efficient frontier in each 
alternative case, and then averaging all of the alterna- 
tives to find an average resampled frontier. Optimal 
portfolios are then selected from among the points on 
the average resampled frontier. 

To construct a resampled frontier, Michaud gener- 
ates a hypothetical alternative history of realized asset 
returns. These are chosen from a multivariate normal 
distribution with mean and covariance equal to the 
unbiased estimates ft and Z, determined based on the 
true history. The alternative history is then used to cal- 
culate portfolios on a resampled efficient frontier. This 
process is repeated a large number of times, with the 
alternative history being chosen independently each 
time. The number of portfolios computed on the fron- 
tier is fixed across all resamplings. In addition, an upper 
bound for possible returns on the resampled frontiers 
can be specified in order to limit the returns on all port- 
folios considered to a finite range. This limit may be 
taken to be the largest element in p, for example. 

The individual portfolios on the resampled frontiers 
are averaged together to form an average resampled 
frontier. All these frontiers are discrete, rather than 
continuous, but if the number of portfolios computed 
is large, a close approximation to a continuous fron- 
tier is obtained. We calculate expected returns and 


standard deviations for each portfolio on the aver- 
age resampled frontier using estimates for mean and 
return such as ft and Z. These values allow us to deter- 
mine which portfolio is optimal under a specified met- 
ric, such as minimum risk, maximum utility, or maxi- 
mum Sharpe ratio. Alternatively, we could compute the 
optimal portfolio along each resampled frontier and 
average the results. The answer in this situation is gen- 
erally not the same as the optimal portfolio on the aver- 
age resampled frontier, but it is generally computation- 
ally much easier because it avoids the need to compute 
all points on the frontier for each resampling. 

Ordering of Returns 

Mean-variance portfolio optimization can be extended 
by allowing an investor to specify less information 
about asset returns than is encompassed by complete 
knowledge of the vector p of expected returns. For 
example, an investor may specify a list of inequalities 
and interrelationships that will hold for elements of 
the return vector r. This type of information leads to 
a much larger set of “efficient” portfolios than mean- 
variance optimization, and it is thus more complicated 
to select a single optimal portfolio from among all 
the efficient ones. Almgren and Chriss resolve this 
difficulty by introducing a methodology for ranking 
portfolios in light of the information specified by the 
investor. They also cast their methodology in a manner 
that makes it computationally feasible to determine an 
optimal portfolio. 
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V. 1 1 Bayesian Inference in Applied 
Mathematics 

Desmond J. Higham 


1 Use of Bayes’s Theorem 

Deterministic models can be used to represent a huge 
variety of physical systems. Applied mathematicians 
are traditionally schooled in a framework where a 
given mathematical model is presented, or developed, 
with known values for all input data, such as prob- 
lem domain, initial conditions, boundary conditions, 
and model parameters. Our task is then to analyze 
properties of the solution by whatever means we have 
at our disposal, for example, finding the exact solu- 
tion, developing approximations under asymptotic lim- 
its, or applying computational methods. However, even 
when a deterministic model is appropriate, there are 
many realistic scenarios where uncertainty arises and 
it becomes beneficial to employ tools from statistics. 
Here are four illustrative examples. 

(1) A partial differential equation describes the spread 
of a pollutant after an environmental disaster. How- 
ever, the initial location and quantity of the pol- 
lutant can only be estimated. What is our uncer- 
tainty in the pollutant level after one week, given 
the uncertainty in the initial data? This is a problem- 
sensitivity or conditioning question. 

(2) A pair of dice are rolled on a table. It is not prac- 
tical to measure the initial location and velocity of 
the dice and then solve their equations of motion 
until they come to a halt. Instead, we can model 
each die independently as a random variable that 
is equally likely to take the value 1, 2, . . . , 6. Here, 
we are introducing randomness as a convenient 
modeling approximation. 


(3) A large chemical reaction network is represented by 
a system of ordinary differential equations. How- 
ever, the value of one reaction rate constant is 
not currently known and is too tricky to measure 
directly from a laboratory experiment. In this case, 
we would like to infer a value for the unknown rate 
constant using whatever data is available about the 
full system. This is a parameter -fitting, or model- 
calibration, problem, where the key question is 
what parameter value causes the model to best 
fit the data or, from a statistical inference per- 
spective: for each possible choice of the parameter 
value, what is my degree of belief that this choice 
is correct? 

(4) Biologists have two competing, and incompatible, 
theories for the mechanism by which signals are 
passed through a transduction network. These lead 
to two different deterministic mathematical mod- 
els, each having one or more unknown modeling 
parameters. Given some experimental data, repre- 
senting outputs from the network, which of the 
two models is most likely to be correct? This is a 
model-comparison problem. 


Questions of this type, where models based on mech- 
anistic laws of motion meet real data, lie at the intersec- 
tion between applied mathematics and applied statis- 
tics. In recent years the term uncertainty quantifi- 
cation [11.34] has been coined to describe this field, 
although the phrase also has many other connotations. 

A powerful tool for statistical inference, and hence 
for uncertainty quantification, is Bayes’s theorem (also 
commonly referred to as Bayes’ theorem), which allows 
us to update our beliefs according to data. The theorem 
is named after the Reverend Thomas Bayes (1701-61), 
a British mathematician and Presbyterian minister. 

Considering item (3) from the list above, let y denote 
the unknown problem parameter in our model and let 
Y denote observational data consisting of a time series 
of chemical species levels. Bayes’s theorem may then 
be written 


P(y I Y) 


P(Y I y)P(y) 
P(Y) 


Here, Ply \ Y), known as the posterior distribution, 
answers our question. It quantifies the probability of 
the model parameter y given the data Y. On the right- 
hand side we have P(Y \ y), known as the likelihood, 
quantifying the probability of the data Y arising given 
the model parameter y. This value is available to us, 
since we have access to the mathematical model. Also 
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appearing on the right-hand side is the prior distribu- 
tion , P(y). This factor quantifies our original degree of 
belief in the parameter value y, before we see any data. 

The denominator P(Y) does not depend on the value 
of the model parameter y, so we have the proportion- 
ality relationship 

P{y | Y) cc P(Y | y)P(y). 

In words, the posterior is proportional to the product 
of the likelihood and the prior. 


2 Example 


To illustrate these ideas we consider the very simple 
production and decay system 

0il, 

A' — 0. 


The first reaction indicates that a species X is created 
at a constant, unit rate. Conversely, the second reaction 
tells us that X also degrades at a rate proportional to 
its current level, with rate constant y. The correspond- 
ing mass action ordinary differential equation for the 
abundance of X at time t is given by 


with solution 


dx(t) 

dt 


1 - yx(t). 


x(t) = l + ( y* (0 )-i ) e -y t . 

y V y ) 

Suppose that our data takes the form Y = { tj , yj } ? = j , 
where yj denotes the observed level of x(t) at time tj. 
Our aim is to use this data to infer the unknown rate 
constant, y. To emphasize that y is not known, we will 
also write the solution as x y (t). 

To define a likelihood, we must impose some as- 
sumptions. The simplest choice is to suppose that the 
data contains experimental measurement errors that 
are independent across time points and normally dis- 
tributed about the “exact” value, with a common stan- 
dard deviation, cr. This leads to a likelihood function 
of the form 


D i / 

P(Y I y) = n y, t ex P ( - 

J =1 V2tt<t- V 


(yj-Xy(tj )) 2 

2o- 2 


)■ 


In words, the right-hand side quantifies the probability 
of the data Y arising given that the rate constant is y. 



1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0 
7 

Figure 1 (a) The solid curve shows the underlying exact 
solution, and the circles indicate the synthetic data gener- 
ated by adding Gaussian noise, (b) The posterior distribu- 
tion for the model parameter, y, whose “correct” value is 
known by construction to be y = 2. 


Figure 1(a) uses a solid line to plot the solution when 
x(0) = 1 and y = 2. To generate some synthetic data, 
we take the solution x(tj) at the time points 1 1 = 0.1, 
t 2 = 0.2, ..., tg = 0.6 and add independent normally 
distributed noise with mean zero and standard devia- 
tion cr = 0.05. These data points are shown as circles. 
Figure 1(b) plots the resulting posterior distribution for 
the rate constant y. Here, we have used the known val- 
ues x(0) = 1 and cr = 0.05 and taken the prior to be 
uniform over [1, 3]; that is, P(y) = k for i ^ y ^ 3 and 
P(y) = 0 otherwise. Because we generated the data our- 
selves, we are able to judge this result. We see that the 
inference procedure has assigned nontrivial weight to 
the “correct” value of y = 2 but it assigns more weight 
to slightly higher values of y, with a peak at around 
2.18. Intuitively, this mismatch has arisen because, by 
chance, the two noisiest data points, at t = 0.2 and 
t = 0.5, are both below the true solution, encourag- 
ing the decay rate to be overestimated. Of course, the 
inference would become more accurate if we were to 
increase the number of data points (or reduce the noise 
level). 

It is slightly more realistic to consider the case where 
the standard deviation, cr, for the experimental noise is 
not known. Conceptually, this presents no added diffi- 
culty, since Bayes’s theorem holds when there are mul- 
tiple variables to be inferred. Figure 2 shows the result- 
ing posterior when we regard cr as a second unknown 
parameter, with a uniform prior over 0.01 ^ cr ^ 0.1. 
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Figure 2 The posterior distribution for the experiment in 
figure 1 when both the model parameter, y, and the experi- 
mental noise level, a, are regarded as unknowns. Here, the 
“correct” values are known by construction to be y = 2 and 
a = 0.05. 

In this very simple setting, with an assumption of 
independent Gaussian noise, the posterior distribu- 
tion is, after taking logarithms, effectively a least- 
squares measure of the mismatch between model 
and data. With a uniform prior, optimizing this least- 
squares objective function corresponds to computing 
a maximum-likelihood estimate: a point where the pos- 
terior is largest. More generally, using a nonuniform 
prior corresponds to adding a penalty term to the 
least-squares objective function, which is a standard 
approach in nonlinear optimization for constraining 
the solution to lie in a predetermined domain. However, 
a key philosophical difference between a least-squares 
best fit and the full computation of a Bayesian poste- 
rior is that in the latter case we are concerned with 
all possible parameter values and wish to know which 
regions of parameter space support likely values, even 
if they are not globally optimal. Moreover, the posterior 
gives access to higher levels of inference. For example, 
we may sample parameter values from the posterior 
and run the model forward in order to build up a dis- 
tribution for the output. Figure 3 illustrates this idea: 
in part (a) we show solution curves arising from five 
independent samples of y from the posterior in fig- 
ure 1, so values where the posterior is larger are more 
likely to be chosen; in part (b) we show a histogram 
of the X level at t = 1 based on 10 000 samples of 
y from the posterior. Knowledge of the full posterior 
distribution also allows us to compute integrals over 
parameter space, with more weight attributed to the 
more likely values. With this approach, given compet- 



Figure 3 (a) As in figure 1, the solid curve shows the under- 
lying exact solution and the circles indicate the synthetic 
data generated by adding Gaussian noise. Five output solu- 
tions are shown as dashed curves, where the rate constant, 
y, has been sampled from the posterior shown in figure 1(b). 
(b) A histogram for the predicted level of species X at the 
future time t = 1, based on 10 000 samples of y from the 
posterior. The “correct” value is known to be x(l) = 0.568. 

ing models we may use so-called Bayes factors in order 
to judge systematically which one best describes the 
data. 

3 Challenges 

The growing popularity of Bayesian inference can be 
attributed, at least in part, to the availability of increas- 
ing computing power and better-quality/more abun- 
dant data sets. Generic challenges in the field include 
the following. 

Priors. A fundamental tenet of Bayesian inference is 
that we must quantify our inherent beliefs at the out- 
set. This allows expert opinions to be incorporated 
into the process but, of course, also places a burden 
on the researcher and introduces a level of subjectiv- 
ity. Even the use of uniform priors, as an indication 
of no inherent preferences, is problematic; for exam- 
ple, being indifferent about y, and hence imposing a 
uniform prior, is not equivalent to being indifferent 
about y 2 . 

Identifiability. The task of inferring model parame- 
ters may be inherently ill-defined. For example, there 
may be insufficient data or, more fundamentally, 
model parameters may be very highly correlated. In 
a dynamical system, for instance, increasing a decay 
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rate may have a very similar effect to decreasing a 
production rate, making it difficult to uncover a pre- 
cise value for each, that is, causing the posterior to 
have a flat, elongated peak in that direction. 

High dimension. Many models in applied mathemat- 
ics involve a large number of parameters. Moving 
from the simple two-parameter example in figure 2 
into very high dimensions poses significant chal- 
lenges in terms of both exploring the parameter 
space and visualizing the results. Progress in this 
area is largely based on the use of Markov chain 
Monte Carlo algorithms, where a markov chain 
[11.25] is constructed whose equilibrium distribution 
matches the desired posterior, allowing approximate 
samples to be computed via long-time path simula- 
tion. Here, the applied mathematics/applied statis- 
tics interface comes back into focus, with many 
Markov chain Monte Carlo notions having counter- 
parts in the fields of optimization and numerical time 
stepping. 

In the applied mathematics setting, a further issue of 
fundamental importance is the construction of a suit- 
able likelihood function. As we saw in our simple exam- 
ple, a deterministic model does not sit well with the 
notion of likelihood, that is, the probability of seeing 
this data, given the model parameters. With a determin- 
istic model, we either recover the data exactly or we do 
not recover it at all. To get around the issue we took the 
commonly used step of placing all the blame on the 
data— assuming that the mismatch is caused entirely 
by measurement error. This step clearly ignores the 
fundamental truth that mathematical models are based 
on idealizations and can never reflect all aspects of a 
physical system. 

At first sight, this issue seems to disappear if we 
begin with a stochastic model; for example, the sim- 
ple production and decay system used in our tests 
could be cast as a discrete-state birth and death pro- 
cess, or a continuous-state stochastic differential equa- 
tion. In either case, we are not forced to assume that 
the data-model mismatch is driven solely by experi- 
mental error. Given the model parameters, by construc- 
tion a stochastic model assigns a probability to any 
observed data, and hence we have an automatic like- 
lihood. However, although stochastic modeling offers a 
seamless transition to a likelihood function, it clearly 
does not overcome objections that the model itself is 
a source of error. Overall, the systematic incorporation 
of uncertainties arising from modeling, discretization, 


and experimental observation remains a high-profile 
challenge. 

Further Reading 

Sivia, D. S., and J. Skilling. 2006. Data Analysis: A Bayesian 
Tutorial , 2nd edn. Oxford: Oxford University Press. 
Smith, R. C. 2013. Uncertainty Quantification. Philadelphia, 
PA: SIAM. 


V.12 A Symmetric Framework with 
Many Applications 

Gilbert Strang 


1 Introduction 


This article describes one of the structures on which 
applied mathematics is built. It has a discrete form, 
expressed by matrix equations, and a continuous form, 
expressed by differential equations. The matrices are 
symmetric and the differential equations are self- 
adjoint. These properties appear naturally when there 
is an underlying minimum principle. 

The important fact about this structure is that it 
is truly useful (and extremely widespread). Problems 
from mechanics, physics, economics, and more fit into 
this framework. The temptation to overwhelm the 
reader with a list of applications is almost irresistible. 
Instead, we go directly to the KKT matrix, which shows 
the form that all of these applications share: 

C _1 A] m rows 
M = T 

A 1 0 n rows 


C is square, symmetric, and positive-definite (and so 
is C _1 ). A is rectangular with n independent columns 
(n ^ m). The block matrix M is then symmetric and 
invertible— but certainly not positive-definite. This is a 
saddle-point matrix. 

The first m pivots are positive, coming from C _1 . The 
elimination steps multiply the first block row by A J C 
and subtract from the second block row to produce 
zeros below C 1 : 


I 

o' 

C- 1 A 


C - 1 

A 

-a t c 

i 

A t 0 


0 

-A r CA 


Elimination has produced the A J CA matrix that is 
fundamental to so many problems in computational 
science. This n x n matrix appears in equation (2), and 
it is the focus of section 3. 

The Schur complement -A T CA in the last block is 
negative-definite, so the last n pivots of M will all be 
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negative. ThenM must have m positive eigenvalues and 
n negative eigenvalues (pivots and eigenvalues have 
the same signs for symmetric matrices). A graph of the 
quadratic, 

Q(w,u ) = ^u' T C _1 iv + u t A t w 


1 

~C- 1 

A 


W 

Wm • • • llyi J 

a t 

0 


u 


would go upward in m eigenvector directions and 
downward in the (orthogonal) n eigenvector directions. 
The saddle point is at (i v,u) = (0, 0). 


1.1 Constrained Least Squares 


Let us see how these matrices appear in a broad class 
of optimization problems. The minimum principle is 
quadratic with linear constraints: 

minimize F(w) = - w T b 

subject to A t w = f. 

The n constraints A J iv = f lead to n Lagrange mul- 
tipliers ui,...,u n . Lagrange’s beautiful idea was to 
build the constraints (using the multipliers) into the 
function L: 


L(w, u) = jU’ J C 1 w - w T b + u J (A t w - /). 


The optimal w and u are found at a stationary point 
(not a minimum!) of L. The coefficient matrix is M: 


3 L 
die 
3 L 
du 


= 0 — ■ C 1 w + Au = b, 
= 0 — • A t if = /. 


( 1 ) 


Examples show convincingly that when the variables 
Wi , . . . , w m have physical meaning (that is, when they 
are currents in circuit theory, stresses in mechanics, 
velocity in fluids, momentum in physics), the dual vari- 
ables «i u« have meaning too. These Lagrange 

multipliers represent the forces needed to impose the 
constraints A T iv = f (they are voltages in circuit 
theory, displacements in mechanics, pressure in fluids, 
position in physics). 

The ius and ms are dual unknowns. The fundamen- 
tal theorem of optimization is the minimax or duality 
theorem. A minimization over iv (the primal problem) 
connects to a maximization over u (the dual problem). 
The first is a minimax of L(w,u), the second is the 
maximin— and they are equal. 


2 Numerical Optimization/Finite Elements 

These two big subjects are not usually combined. The 
finite-element method [11.12] is an approach (a suc- 
cessful one) to solving differential equations, opti- 
mization [IV. 1 1 ] is generally seen in a different world. 
But both subjects present us with the same choices, 
precisely because equation (1) is central to both. Here 
are three options for computing u and iv. 

(i) The mixed method for finite elements and the 
primal-dual method for optimization solve for u 
and w together. We work with M. 

(ii) The stress method (or nullspace method) finds the 
best if among all candidates that satisfy the con- 
straints A J w = f. If w p is one particular solu- 
tion, we add any solution of A t iv = 0 (Kirchhoff’s 
current law). We therefore need a basis for the 
nullspace of A r . 

(iii) The displacement method eliminates iv = C(b - 
Au) from equation (1) and solves for u. Multiply 
the first row by A J C and subtract from the second 
row, and then reverse all signs to work with A T CA: 

A t CAu = A T Cb - f. (2) 

Equation (2) is a central problem of scientific com- 
puting. The matrix A T CA is symmetric positive-definite 
(sign reversal produced a minimization). K = A r CA is 
the stiffness matrix in finite elements. It is the graph 
Laplacian matrix that appears everywhere in applied 
mathematics. In statistics and linear regression, we 
have the weighted normal equations. 

For solid mechanics, this A J CA approach is the 
nearly universal choice. For fluid mechanics, with pres- 
sures and velocities, the mixed method is often pre- 
ferred. This mathematical contest is reflected in bat- 
tles between software packages: variations of spice 
for electronic circuits and of finite-element codes like 
NASTRAN, FEMLAB, and ABAQUS. 

An excellent reference for the numerical solution of 
saddle-point problems is the survey by Benzi et al. 
(2005). For fluid problems and mixed methods (with 
preconditioning), Elman et al. (2014) is outstanding. 

3 The Framework 

Here is the key sentence from Strang (1988): “I believe 
that this A T CA pattern is the central framework for 
problems of equilibrium ... continuous and discrete, 
linear and nonlinear, equations and variational princi- 
ples.” A t CA is the system matrix or stiffness matrix that 
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contains the geometry in A, the material properties in 
C, and the physical laws in A T . The problems that we 
need to solve are 


A t CAu = f, A t CAx = Ax, 


dn 

dt 


-A t CAu, 


d 2 u 

d i 2 


= -A r CAu. 


The mathematical model produces A T CA. The compu- 
tational problem then deals with that matrix. We build 
up the equations in three steps (A and C and A T ). Then 
we attack the equations numerically and analytically. 

The continuous model has functions F instead of vec- 
tors /, and derivatives A = d/dx and A = gradient 
instead of matrices. But A T CA is still present: 


Continuous linear ODE: 


d 
dx V 


c(x) 


d it' 

dx t 


F(x), A = 


dx ' 


Continuous linear partial differential equation: 

-div(cgradn) = F(x,y). 

Continuous nonlinear partial differential equation: 

grad it 


div( 


V 1 + I grad u | 


?) 


0 . 


When A is d/dx, its transpose is -d/dx. The diagonal 
matrix C, which multiplies every component of a vec- 
tor Ait, is replaced by c(x), which multiplies dn/dx at 
every point. 

The partial differential equations with A = grad and 
A t — — div involve partial derivatives: 


A = gradient 


d/dx 

d/dy 


-divergence 


_d_ 

dx 



Between A T and A comes C, often linear but not always. 
We v\ill write e for Am: 


the three-step framework, 

e = Au, te = Ce, F = A J w, 


A 

potential u(x,y) — » gradient e(x,y) 

C A T 

— » flow w(x,y) — * source F(x,y). 

For flow in a network, A and A T express Kirchhoff’s 
voltage and current laws and w = Ce is Ohm’s law: 

voltages it -► voltage drops e 

— currents w — sources F. 

A, C, and A T are matrices or differential operators. 
They combine into a positive-definite stiffness matrix 


K = A t CA. Positive-definite matrices are associated 
with minimum principles for the total energy in the 
system: 

E(u) = t}U t Ku - u J F = |(Ait) T C(Au) - u T F. 

The minimum of E( u) occurs where Ku = F. We have 
recovered A T CAit = F. When C is a matrix multiplica- 
tion, the energy E(u) is quadratic and its derivative is 
linear. When C is nonlinear, CAu is really C (Au). In this 
case, complicated by nonlinearity, minimum principles 
are often most natural. 


4 Nonlinear Equations and 
Minimum Principles 


When the material properties are nonlinear, w = Ce 
becomes w = C(e). At any point on that curve, the 
tangent gives the local linearization Aie » C'(e)Ae. A 
linear to nonlinear example is the step from Newton’s 
law, 


F = ma 



to Einstein’s law. For relativity the mass m is not 
constant. The momentum is not a linear function: 


Newton's momentum p = mv. 


Einstein’s momentum p ( v ) 


m ov 

Vl - v 2 /c 2 


Returning from p, m, v to iv, c, e, we want the equa- 
tion iv = C(e) to appear in the derivative of the energy. 
The energy is, therefore, an integral: 

energy = J C(e) de, 

linear case J cede = \ce 2 . 

The reverse direction uses the variable w. In the linear 
case we simply divide: e = w/c. In the nonlinear case 
iv = C(e) is monotonic, and e = w/c changes to the 
inverse function e = C~ l (w). When we want the energy 
as a function of iv, we integrate as before: 

energy = J C _1 (u | ) die, 

, f w , 1 ie 2 

linear case — die = — — . 

J c 2 c 


These two functions F* (e) andf(ie), the integrals of 
inverse functions C(e) and C _1 (le), are related by the 
Legendre-Fenchel transform. 
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5 The Graph Laplacian Matrices: 

A t A and A { C A 

If we plan to offer one example in greater detail, the 
graph Laplacian must be our choice. Graphs are the 
dominant model in discrete applied mathematics. 

Start with n nodes. Connect them with m undirected 
edges (a complete graph would have all \n(n - 1) 
edges; a spanning tree would have n - 1 edges and no 
loops). The figure below shows n = 4 and m = 5. We 
may imagine that each edge leaves from its lower num- 
bered node, mdicated by - 1 in the edge-node incidence 
matrix A: 



-1 

1 

0 

0~ 

edge 1 

-1 

0 

1 

0 

2 

0 

-1 

1 

0 

3 

-1 

0 

0 

1 

4 

0 

-1 

0 

1 

5 


The n columns are not independent. A is a “difference 
matrix.” All differences between pairs of us are zero 
when the voltages ui, U2, U3, U4 are equal: 





U2 ~ U\ 

A 

U\ 

U2 


U3 - U\ 

U3 ~ U2 


U3 

U4 


U4 — 'll \ 




U4 — XI2 


Au = 0 for u = (c, c, c, c). 

With the vector (1, 1, 1, 1) in the nullspace, the sum of 
the four columns is the zero column. The rank of A 
is 3. For every connected graph the rank is n - 1, and 
the one-dimensional nullspace contains the constant 
vectors u c = (c, . . . , c). 

Since these vectors have Ait c = 0, they also have 
A t CAu c = 0. The Laplacian A J A and the weighted 
Laplacian matrix A T CA will be positive semidefinite but 
not positive-definite. Each row will sum to zero. When 
we fix one voltage at zero and remove it from the set 
of n unknown us, this removes a column of A and row 


of A r to leave reduced Laplacians of size n - 1. Then 
A J A and A T CA are positive-definite. 

For networks, C is an m x m diagonal matrix that 
comes from Ohm's law iCj = Cje;. The numbers 
ci,..., c m on the diagonal are positive (so C is positive- 
definite, which is the property we need). 

Matrix multiplication produces A T A (and then we 
identify the pattern): 


3 

-1 

-1 

-1 

-1 

3 

-1 

-1 

-1 

-1 

2 

0 

-1 

-1 

0 

2 


The diagonal entries are the degrees of the nodes: the 
number of adjacent edges. The off-diagonal entry in the 
j, k position is -1 if an edge connects nodes j and k. 
The zero entries in the 3, 4 and 4, 3 positions reflect 
the nonexistence of an edge between those nodes. 

It is useful to separate the unweighted Laplacian A r A 
into D - W. The diagonal matrix D = diag(3, 3, 3, 2) 
shows the degrees of the nodes. The off-diagonal 
matrix W is the adjacency matrix, with Wjk = 1 when 
nodes j and k are connected. 

Finally, we compute A r CA. Why not multiply matri- 
ces as columns times rows instead of always rows times 
columns? Column 1 of A J multiplies row 1 of CA (which 
is just ci times row 1 of A). So the first piece of A J CA 
comes entirely from edge 1 in the network, connecting 
nodes 1 and 2: 

[-1 1 0 0] T ci[-l 1 0 0] 

ci -ci 0 0 

-ci ci 0 0 

0 0 0 0 ' 

0 0 0 0 

This is the element stiffness matrix for edge 1. Each 
edge i will produce such a matrix, full size but with only 
four nonzeros. For the edge connecting nodes j and k, 
the only nonzeros will be in those rows and columns. 

The product A r CA then assembles (adds) these five 
element matrices containing ci,...,C5. They overlap 
only on the main diagonal, to produce D - W: 


/ 

Cl + C2 + C4 

\ 


Ci + C3 + C5 



C2 + C 3 


V 

C4+C5 

/ 


V. 1 3. Granular Flows 


665 


0 

Cl 

C2 

C4 

Cl 

0 

C3 

C5 

C2 

C3 

0 

0 

C4 

C5 

0 

0 


The weighted adjacency matrix W contains the a 
(previously all Is). The weighted degree matrix D shows 
the edges touching each node. More simply, the diag- 
onal D ensures that every row of A T CA adds to zero. 
The smallest eigenvalue is then Ai = 0. 

The next eigenvalue, A 2 , plays an important role 
in applications. Its eigenvector X 2 will be orthogo- 
nal (since A J CA is symmetric) to the first eigenvec- 
tor x\ = (1, 1, 1, 1). Separating positive and negative 
components of this Fiedler eigenvector X 2 is a use- 
ful way to cluster the nodes. More heavily weighted 
edges lie within the two clusters than between them. 
When a graph comes to us with no clear structure or 
interpretation, clustering is an important step toward 
understanding. 

So we end with modern applications of the same 
graph Laplacian that is at the heart of network theory 
and mass-spring systems. The fundamental construc- 
tions of electronics and mechanics— the laws of Ohm 
and Kirchhoff and Hooke— produce the saddle-point 
matrix M in equation (1) before elimination removes 
the res (currents and stresses). From that elimination 
the graph Laplacian matrix appears! 

A T CA expresses a structure that is fundamental for 
all graphs and networks: in engineering, in statistics 
(where C' 1 is the covariance matrix), in science, and in 
mathematics. 
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V.13 Granular Flows 

Joe D. Goddard 


1 Introduction 

Granular materials represent a major object of human 
activities: as measured in tons, the first material 
manipulated on Earth is water; the second is granular 
matter.... This may show up in very different forms: 
rice, corn, powders for construction.... In our suppos- 
edly modern age, we are extraordinarily clumsy with 
granular systems. 

P. G. de Gennes, “From Rice to Snow,” 
Nishina Memorial Lecture 2008 

This quote by a French Nobel laureate is familiar to 
many in the field of granular mechanics and reflects a 
long-standing scientific fascination and practical inter- 
est. This state of affairs is acknowledged by many oth- 
ers and summarized in the articles listed in the list of 
further reading at the end of the article. 

As a general definition, we understand by granu- 
lar medium a particle assembly dominated by pair- 
wise nearest-neighbor interactions and usually limited 
to particles larger than 1 micrometer in diameter, for 
which the direct mechanical effects of van der Waals 
and ordinary thermal (“Brownian”) forces are negligi- 
ble. This includes a large class of materials, such as 
cereal grains, pharmaceutical tablets and capsules, geo- 
materials such as sand, and the masses of rock and ice 
in planetary rings. 

Most of this article is concerned either with dry gran- 
ular materials in which there are negligible effects of air 
or other gases in the interstitial space, or with granular 
materials that are completely saturated by an intersti- 
tial liquid. In this case, liquid surface tension at “cap- 
illary necks” between grains, as in the wet sand of 
sand castles, or related forms of cohesion are largely 
negligible. 

There are several important and interrelated aspects 
of granular mechanics: 

(i) experiment and industrial or geotechnical applica- 
tions; 

(ii) analytical and computational micromechanics (dy- 
namics at the grain level); 

(iii) homogenization (“upscaling” or “coarse graining”) 
to obtain smoothed continuum models from the 
Newtonian mechanics of discrete grains; 
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(iv) mathematical classification and solution of the con- 
tinuum held equations for granular flow; and 

(v) development of continuum models (“constitutive 
equations”) for stress-deformation behavior. 

This article focuses on item (v), since the study of 
most of the preceding items either depends on it 
or is strongly motivated by it. The following discus- 
sion of (iv) sets the stage for the subsequent coverage 
of (v). 

2 Field Equations and Constitutive Equations 

By field equations we mean the partial differential equa- 
tions (PDEs) (with time t and spatial position x= [x;] 
as independent variables) that represent the contin- 
uum-level mass and linear momentum balances gov- 
erning the density p(x,t), the velocity v(x,t ) = [vj], 
and the symmetric stress tensor T(x, t) = [ T t - ; - J for any 
material. 

p = -pV ■ v and pt> = V ■ T. (1) 

The indices refer to components in Cartesian coordin- 
ates, and the notation “= [■]” indicates components 
of vectors and tensors in those coordinates. Written 
out, (1) becomes 3 t p + dj(pi’j) = 0 and 3 t (pv,) = 
dj(Tij - pi’iVj) for i = 1, 2, 3, where w = d t w + VjdjW 
for any quantity w, sums from j = 1 to 3 are taken over 
terms with repeated indices, 3,ic = dw/dxt, dtiv = 
dw /dt, and the product rule has been used. For more 
information on (1), see continuum mechanics [IV.26]. 

There are ten dependent variables (p, vu and t y = 
Tji) but only four equations in (1). In order to “close” 
them (to make them soluble), we need six more equa- 
tions. For example, the closure representing the consti- 
tutive equations for an incompressible Newtonian fluid 
(see NAVIER-STOKES EQUATIONS [III.23]) is 

T' = 2 qD' and p = const., (2) 

where q is a coefficient of shear viscosity and D denotes 
the rate of deformation, also a symmetric second-rank 
tensor, D = sym Vu = + djVi\, where “sym” 

denotes the symmetric part of a second-rank tensor. 
The prime denotes the deviator (“traceless” or “shear- 
ing" part) X' = X - | (tr X)I or X' tj = Xij - \x kk 5ij , 
where Stj are the components of the identity /. 

For Theologically more complicated (“complex”) flu- 
ids, such as viscoelastic liquids, the stress at a mate- 
rial point may depend on the entire past history of 
the velocity gradient Vv at that point. Viscoelasticity is 
exemplified by the simplest form of the Maxwell fluid, 


giving the rate of change of T' as a linear function of 
D' and T'\ 

f = 2 pD' - XT' = X(2qD' - T'), (3) 

whereX = X+sym (XW),XW = [XikW k j], W is the skew 
part of the velocity gradient, W = ^[3 {Vj — djVi], A is 
the relaxation rate or inverse of the relaxation time, 
and p is the elastic modulus. In this model, p can be 
identified with the elastic shear modulus G, which is 
discussed below. 

According to (3), which was proposed in a slightly dif- 
ferent form by James Clerk Maxwell in 1867, the coeffi- 
cient of viscosity and the elastic modulus are connected 
by q = p/A, and a material responds elastically on a 
timescale A -1 with viscosity and viscous dissipation 
arising from relaxing elastic stress. 

One obtains from (3) elastic-solid or viscous-fluid 
behavior in the respective limits A — ■ 0 or A — oo. This 
is illustrated by the quintessential viscoelastic mate- 
rial “silly putty,” which bounces elastically but under- 
goes viscous flow in slow deformations. Rheologists 
employ a Deborah number of the form y/A to distin- 
guish between rapid solid-like deformations and slow 
fluid-like deformations. 

The superposed “°” in (3) denotes a Jaumann deriva- 
tive, which gives the time rate of change in a frame 
translating with a material point and rotating with 
the local material spin. The Jaumann derivative is the 
simplest objective time derivative that embodies the 
principle of material frame indifference. This principle, 
which is not a fundamental law of mechanics, is tan- 
tamount to the assumption that arbitrary rigid-body 
rotations superimposed on a given motion of a mate- 
rial merely rotate stresses in the same way. Stated 
more generally, accelerations of a body relative to an 
inertial (“Newtonian”) frame do not affect the stress- 
deformation behavior. The principle is already reflected 
in the simpler constitutive equation (2), where only the 
symmetric part of Vu is allowed. 

While frame-indifference is a safe assumption for 
most molecular materials, which are dominated at the 
molecular level by random thermal motion, it could 
conceivably break down for granular flows, where ran- 
dom granular motions are comparable to those asso- 
ciated with macroscopic shearing. Nevertheless, the 
assumption will be adopted in the constitutive models 
considered here. 

Note that, for a fixed material particle x°, (3) repre- 
sents a set of five simultaneous ordinary differential 
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equations, which we call Lagrangian ordinary differ- 
ential equations (LODEs). In the Eulerian or spatial 
description, they must be combined with (1) to pro- 
vide nine equations in the nine dependent variables, 
p = - 1 tr T, v, and T' , governing the velocity field of 
the Maxwell fluid. Solving the problem, subject to vari- 
ous initial and boundary conditions, is no easy task and 
usually requires advanced numerical methods. 

We identify below a broad class of plausible consti- 
tutive equations, which basically amount to general- 
izations of (3). They include the so-called hypoplastic 
models for granular plasticity. It turns out that all of 
these can be obtained from elastic and inelastic poten- 
tials, which are familiar in classical theories of elasto- 
plasticity. The above models can be enlarged to include 
models of visco-elastoplasticity that are broadly appli- 
cable to all the prominent regimes of granular flow. Fol- 
lowing the brief review in the following section of phe- 
nomenology and flow regimes, a systematic outline of 
the genesis of such models wall be presented. 

This article does not deal with the rather large body 
of literature on numerical methods, neither the direct 
simulation of micromechanics by the distinct element 
method (DEM) nor solution of the continuum field equa- 
tions based on finite-element methods (FEMs) or related 
techniques. 

3 Phenomenological Aspects 

We consider here the important physical parameters 
and dimensionless groups that characterize the various 
regimes of granular mechanics and flows. The review 
article of Forterre and Pouliquen that appears in the 
further reading section at the end of this article pro- 
vides more quantitative comparisons for fairly simple 
shearing flows. 

3.1 Key Parameters and Dimensionless Groups 

Apart from various dimensionless parameters describ- 
ing grain shape, the most prominent physical param- 
eters for noncohesive granular media are grain elas- 
tic (shear) modulus G s , intrinsic grain density p s , rep- 
resentative grain diameter d, intergranular (Coulomb) 
contact-friction coefficient p s or macroscopic counter- 
part pc, and confining pressure p s . These parame- 
ters define the key dimensionless groups that serve 
to delineate various regimes of granular flow: namely, 
an elasticity number, an inertia number, and a viscosity 
number, given, respectively, by 

E = Gs/Ps, I = yd^p s /ps, H = PsY/Ps, 


and involving y, a representative value of the shear rate 
\D'\. We will also refer briefly below to a Knudsen num- 
ber based on the ratio of microscopic to macroscopic 
length scales. 

Note that I is the analogue of the Deborah number 
mentioned above, but it now involves a relaxation rate 
that represents the competition between grain iner- 
tia and Coulomb friction, pcPs- The quantity I 2 repre- 
sents the ratio of representative granular kinetic energy 
p s d 2 y 2 to frictional confinement. 

The transition from a fluid-saturated granular medi- 
um to a dense fluid-particle suspension occurs when 
the viscosity number I » 1, where viscous and fric- 
tional contact forces are comparable. 

In the case of fluid-particle suspensions, it is cus- 
tomary to identify I 2 /H as the “Stokes number” or the 
“Bagnold number” (after a pioneer in granular flow), 
representing the magnitude of grain-inertial forces to 
viscous forces. 

In the following sections we focus our attention on 
dry granular media, including only a very brief mention 
of fluid-particle systems. 

3.2 Granular Flow Regimes 

Although granular materials are devoid of intrinsic 
thermal motion at the grain level, they nevertheless 
exhibit states that resemble the solid, liquid, and 
gaseous states of molecular systems. All these granular 
states may coexist in the same flow field as the analogs 
of “multiphase flow.” There are several open questions 
as to the proper matching of solid-like immobile states 
with the rapidly sheared states, but there is not enough 
space in this brief article to address them here. 

With t denoting a representative shear stress and y 
a representative shear strain relative to a rest state (in 
which t/Ps = 0 and I = 0), the various flow regimes are 
shown in the qualitative and highly simplified sketch 
in figure 1 and in table 1. In the figure, the dimension- 
less ratio r/ps is represented as a function of a single 
dimensionless variable representing an interpolating 
form: 

X ~ p c Ey/(Ey + p c ) + I 2 . (4) 

(A compound representation, closer to the constitu- 
tive models considered below and illustrated in fig- 
ure 5, is T/ps = pc + I 2 , with yE = (t /p s )E, of which 
the first represents a Coulomb-Bagnold interpolation 
found in certain constitutive models and where ye is 
elastic deformation at any stress state.) 

Figure 1 fails to capture the strong nonlinearity and 
history dependence of granular plasticity in regime lb: 
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Table 1 The granular-flow regimes shown in figure 1 
(the last column gives the scaling of the stress t). 


I 

Quasi-static: (Hertz-Coulomb) 
elastoplastic “solid” 


la 

(Hertz) elastic 

G s y 

lb 

(Coulomb) elastoplastic 

VC 

II 

Dense-rapid: viscoplastic 
“liquid” 

Ps/00 

III 

Rarified-rapid: (Bagnold) 
viscous “granular gas” 

Psd 2 y 2 


a matter that is addressed by the constitutive models 
discussed below. 

Like the liquid states of molecular systems, dense 
rapid flow, represented as a general dimensionless 
function / (I) in table 1, may be the most poorly under- 
stood regime of granular mechanics. It involves impor- 
tant phenomena such as the fascinating granular size 
segregation. There is some evidence that this regime 
may involve an additional dependence on E for soft 
granular materials. 

3.2.1 The Elastic Regime 

The geometry of contact between quasispherical lin- 
ear (Hookean) elastic particles should lead to nonlinear 
elasticity of a granular mass at low confining pressures 
p s and to an interesting scaling of elastic moduli and 
elastic wave speeds with pressure. 

Based upon Hertzian contact mechanics, a rough 
“mean-field” estimate of the continuum shear mod- 
ulus G in terms of grain shear modulus G s is given 
by G/G s ~ E -1/3 , which indicates a |-power depend- 
ence on pressure. For example, with p s ~ 100 kPa and 
G s ~ 100 GPa (a rather stiff geomaterial), one finds 


E ~ 10 6 and G ~ 10~ 2 G S , amounting to a huge reduc- 
tion of global stiffness due to relatively soft Hertzian 
contact. 

In principle, we should replace G s by G in table 1 and 
(4), and E = G s /p s by G/p s = E _2/3 . In so doing, we 
obtain an estimate of the limiting elastic strain for the 
onset of Coulomb slip (with pc ~ 1) to be y e ~ E _2/3 ~ 
1CT 4 , given the above numerical value. Although crude, 
this provides a reasonable estimate of the small elastic 
range of stiff geomaterials such as sand. 

3.2.2 The Elastoplastic Regime 

Because of its venerable history, dating back to the clas- 
sical works of Coulomb, Rankine, and others in the 
eighteenth and nineteenth centuries, and its enduring 
relevance to geomechanics, the field of elastoplastic- 
ity is the most thoroughly studied area of granular 
mechanics. Here we touch on a few salient phenom- 
ena, and the theoretical issues surrounding them, by 
way of background for the discussion of constitutive 
modeling that follows. 

Figure 2 shows the results of Wolfgang Ehlers’s two- 
dimensional DEM simulations of a quasistatic hop- 
per discharge and a biaxial compression test, both of 
which illustrate a well-known localization of deforma- 
tion into shear bands. A hallmark of granular plastic- 
ity, this localized slip (or “failure”) may be implicated 
in dynamic “arching,” with large transient stresses on 
bounding surfaces such as hopper walls or structural 
retaining walls. Similar phenomena are implicated in 
large-scale landslides. 

Figure 3 shows the development of shear bands in a 
standard experimental quasistatic compression test on 
a sand column surrounded by a thin elastic membrane. 
The literature abounds with many interesting experi- 
mental observations and numerical simulations (see, 
for example, the book by Tejchman that appears in the 
further reading below). 

The occurrence of shear bands can be viewed mathe- 
matically as material bifurcation and instability arising 
from loss of convexity in the underlying constitutive 
equations, accompanied by a change of type in the PDEs 
involved in the field equations. We recall that similar 
changes of type, e.g., from elliptic PDEs to hyperbolic 
PDEs, are associated with phenomena such as the gas 
dynamic transition from subsonic to supersonic flows 
with formation of thin “shocks.” 

To help understand the elastoplastic instability, fig- 
ure 4 presents a qualitative sketch of the typical stress- 
strain/dilatancy behavior in the axial compression of 
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Figure 2 DEM simulations: (a) shear bands in slow hopper 
flow and (b) shear bands in axial compression. (Courtesy of 
Wolfgang Ehlers, University of Stuttgart.) 


dense and loose sands, with a denoting compressive 
stress, £ compressive strain, and £y = log(V/Vo) vol- 
umetric strain (where a volume Vo has been deformed 
into V). While no numerical scales are shown on the 
axes, the peak stress and the change of £y from negative 
(contraction) to positive (dilation) typically occur at 




Figure 3 Experimental axial compression of a dry Hos- 
tun sand specimen: (a) before compression and (b) after 
compression. (Courtesy of Wolfgang Ehlers.) 


strains of a few percent (f ~ 0.01-0.05) for dense 
sands. 

Although it is tempting to regard the initial growth of 
a with f as elastic in nature, the elastic regime is repre- 
sented by much smaller strains (corresponding to the 
estimates of order 1CT 4 given above), corresponding to 
a nearly vertical unloading from any point on the a-s 
curve. It is therefore much more plausible that the ini- 
tial stress growth represents an almost completely dissi- 
pative plastic “hardening” associated with compaction 
(fv < 0) accompanied by growth of contact number 
density n c and contact anisotropy, whereas the max- 
imum in stress can be attributed to the subsequent 
decrease in n c that accompanies dilation (£y > 0). 

As will be discussed below, one can rationalize the 
formation of shear bands as the result of “strain soft- 
ening” or unstable “wrong-way” behavior (a decrease 
of incremental force with incremental displacement) 
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Figure 4 Schematic of triaxial stress/dilatation-strain 
curves for initially dense (solid curves) and loose (dashed 
curves) sands. 

following the peak stress. According to certain analy- 
ses, this can arise as a purely dissipative process asso- 
ciated with the decrease of plastic stress, whereas oth- 
ers hypothesize that it may be the result of a peculiar 
quasielastic response. 

Whatever the precise nature of the material instabil- 
ity, it generally calls for multipolar or “higher-gradient” 
constitutive models involving an intrinsic material 
length scale. Hence, to our list of important dimension- 
less parameters we must now add a Knudsen number 
K, as discussed below. Here, it suffices to say that a 
material length scale -2 is necessary to regularize field 
equations in order to avoid sharp discontinuities in 
strain rate by assigning finite thickness to zones of 
strain localization. Such a scale, clearly evident in the 
DEM simulations and experiments of figures 2 and 3, 
cannot be predicted with the nonpolar models that 
constitute the main subject of this article. 

3.2.3 Springs and Slide Blocks 

Setting aside length-scale effects, the simple mechani- 
cal model in figure 5 provides a useful intuitive view 
of the continuum model considered below. There, 
the applied force T' is the analogue of continuum- 
mechanical shear stress, and the rate of extension of 
the device represents deviatoric deformation rate D' . 

The serrated slide block at the top of figure 5 
is a modification of the standard flat plastic slide 
block, where sliding stress t represents a pressure- 
independent yield stress, or a frictional slide block 
with T/ps = he representing the (Amontons-Coulomb) 
coefficient of sliding friction. The serrated version rep- 
resents the granular “interlocking” model of Taylor 



Figure 5 Slide block/spring/dashpot 
analogue of visco-elastoplasticity. 


(1948) or the “sawtooth” model of Rowe (1962) based 
on the concept of granular dilatancy introduced in 
1885 by Osborne Reynolds (who is also renowned for 
his work in various fields of fluid mechanics). Accord- 
ing to Reynolds, the shearing resistance of a gran- 
ular medium is partly due to volumetric expansion 
against the confining pressure. Thus, if the sawtooth 
angle vanishes, then one obtains a standard model of 
plasticity, with resistance due solely to yield stress or 
pressure-dependent friction. 

At the point of sliding instability in the model, where 
the maximum volume expansion occurs, part of the 
stored volumetric energy must be dissipated by sub- 
sequent collapse and collisional impact. This may be 
assumed to occur on extremely short timescales, giving 
rise to an apparent sliding friction coefficient pc even if 
there is no sliding friction between grains (p s = 0). This 
rapid energy dissipation is emblematic of tribological 
and plastic-flow processes, where, owing to topological 
roughness and instability in the small, stored energy is 
thermalized on negligibly small timescales giving rise 
to rate-independent forces in the large. 

As a second special feature of the model, the spring 
with constant pr, converts plastic deformation into 
“frozen elastic energy” and also provides an elemen- 
tary model of plastic work hardening. This elastic ele- 
ment is intended to illustrate the fact that some of 
the stored elastic energy can never be entirely recov- 
ered solely by the mechanical action of T' , which leads 
to history effects and associated complications in the 
thermodynamic theories of plasticity. 

3.2.4 Viscoplasticity 

For rapid granular flow, the viscous dashpot with (Bag- 
nold) viscosity pg (see figure 5) adds a rate-dependent 
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force associated with granular kinetic energy and col- 
lisional dissipation. This leads to a form of Bingham 
plasticity that is discussed below and is described by 
certain rheological models, providing a rough interpo- 
lation between regimes I and III in figure 1. The viscous 
dashpot can also represent the effects of interstitial flu- 
ids. Note that removal of the plastic slide block gives 
the Maxwell model of viscoelasticity (3) whenever the 
spring and dashpot are linear. 

Apart from presenting constitutive models that en- 
compass those currently employed, we do not deal 
directly with the numerous issues and challenges in 
modeling the dense-rapid flow regime. 

4 Constitutive Models 

Inevitably, constitutive models for granular flow are 
complicated because of the wide range of phenomenol- 
ogy observed. We use a class of models that relate 
Cauchy stress T to deformation rate D, often referred 
to as the Eulerian description. In particular, we focus on 
a class of generalized hypoplastic models, based on a 
stress-space description that allows for nonlinear func- 
tions of stress. Although the approximation of linear 
elasticity is suitable for many granular materials, par- 
ticularly stiff geomaterials, the nonlinear theory may 
find application in the field of soft granular materials 
such as pharmaceutical capsules and clayey soils. 

4. 1 Hypoplasticity 

Let us start with an isotropic nonlinearly elastic mate- 
rial. This is one for which the stress T and the finite 
(Eulerian) strain tensor E = \(FF J - I) are connected 
by isotropic relations of the form 

T = t(E) or E = r 1 (T). (5) 

Here, F = dx/dx°, where x° and x(x°) denote the ref- 
erence position and current placement, respectively, 
of material points. Hyperelasticity (or “Green elastic- 
ity”) is based on the thermodynamic consistency con- 
dition that the functions in (5) be derivable from 
elastic potentials. Truesdell's hypoelasticity—a gener- 
alization of elasticity that allows for a more general 
rate-independent but path-dependent relation between 
stress and strain— is represented by the LODE 

T = Pe(T) : D, (6) 

where pp is the fourth-rank hypoelastic modulus. With 
suitable integrability conditions on this modulus, it is 
possible to show that (6) implies an elastic relation of 
the form (5). 


To obtain visco-elastoplasticity from (6), we adapt 
the first fundamental postulate of incremental plas- 
ticity: that the deformation rate can be decomposed 
into elastic and inelastic (or “plastico-viscous”) parts, 
D = De + Dp. Then replace D in the relations above 
by De = D - Dp and provide a constitutive equation 
for Dp. 

As the second fundamental postulate, we assume 
that the elastic stress T conjugate to Dp is identical 
to the inelastic stress conjugate to Dp, which follows 
from an assumption of internal equilibrium of the type 
suggested by the simple model in figure 5. 

Finally, as the third fundamental postulate, we as- 
sume that the system is strongly dissipative such that 
T and Dp are related by dual inelastic potentials. Vis- 
coelasticity follows and yields a nonlinear form of the 
Maxwell fluid (3). 

Plasticity, as defined by rate-independent stress, rep- 
resents a singular exception to the above: the mag- 
nitude ].Dp| becomes indeterminate. The assumption 
of overall independence of rate and history implies a 
relation of the form 

!D P | = \D\9(T,D), 

where D = D/\D\ and ]D| = ^DyDy, and, hence, a 
generalized form of isotropic hypoplasticity, 

f = p H (T,D) :D, (7) 

with a formula for pn in terms of the “inelastic clock” 
function 9. In the standard theory of hypoplasticity, 
9(T,D) is independent of D, there is no distinction 
between elastic and plastic deformation rates, and the 
constitutive equation reduces to (7) with hypoplastic 
modulus taking the form 

PH = PE(T) ~ K(T) ® D, (8) 

where A® B = [AyBfcj]. Such models are popular with 
many in the granular mechanics community (particu- 
larly those interested in geomechanics), mainly because 
they do not rely directly on the concept of a yield 
surface or inelastic potentials (whence the qualifier 
“hypo”), although they do involve limit states where 
Jaumann stress rate vanishes asymptotically under the 
action of a constant deformation rate D. This asymp- 
totic state may be identified with the so-called critical 
state of soil mechanics, where granular dilatancy van- 
ishes and the granular material essentially flows like an 
incompressible liquid. 

The problem with the general form (8) is that 
it is exceedingly difficult to establish mathematical 
restrictions that will guarantee its thermodynamic 
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admissibility, e.g., such that steady periodic cycles of 
deformation do not give positive work output. By con- 
trast, models built up from elastic and dissipative 
potentials are much more likely to be satisfactory in 
that regard. 

The treatise by Kolymbas that appears in the further 
reading provides an excellent overview and history of 
hypoplastic modeling. 

4.2 Anisotropy, Internal Variables, and Parametric 
Hypoplasticity 

Initially isotropic granular masses exhibit flow-induced 
anisotropy, particularly in flow that is quasistatic 
elastoplastic, since the grain kinetic energy or “temper- 
ature” is insufficient to randomize granular microstruc- 
ture. This anisotropy is often modeled by an assumed 
dependence of elastic and inelastic potentials, and 
associated moduli, on a symmetric second-rank fabric 
tensor, A = [Ay ] , which is subject to a rate-independent 
evolution equation of the form 

A = <x(T,D,A) : D. 

This represent a special case of the “isotropic exten- 
sion” of anisotropic constitutive relations, where the 
anisotropic dependence on stress T or strain E is 
achieved by the introduction of a set of “structural ten- 
sors,” consisting in the simplest case of a set of second- 
rank tensors and vectors. The structural tensors rep- 
resent in turn a special case of a set X of evolution- 
ary internal variables consisting of scalars, vectors, and 
second-order tensors. 

At present, the origins of the evolution equations are 
unclear, although there may be a possibility of obtain- 
ing them from a generalized balance of internal forces 
derived on elastic and inelastic potentials that depend 
on the internal variables or their rate of change. 

Whatever the origin of their evolution equations, it is 
easy to see that one can enlarge the set of dependent 
variables from T to {T,X} v\ith the evolution equa- 
tions leading to parametric hypoplasticity defined by 
a generalization of (7). This gives a set of LODEs that 
describe the effects of initial anisotropy, density, or vol- 
ume fraction, etc., represented as initial conditions, and 
their subsequent evolution under flow. This formula- 
tion includes most of the current nonpolar models of 
the evolutionary plasticity of granular media. 

4.3 Multipolar Effects 

In a system that is strongly inhomogeneous, such as 
the shear field associated with shear bands, we expect 


to encounter departures from the response of a clas- 
sical simple (nonpolar) material having no intrinsic 
length scale. The situation is generally characterized 
by nonnegligible magnitudes of a Knudsen number 
K = £/L, where ■£ is a characteristic microscale and L 
is a characteristic macroscale. 

In the case of elastoplasticity, various empirical mod- 
els suggest taking -f to be about five to ten times the 
median grain diameter: a scale that is most plausi- 
bly associated with the length of the ubiquitous force 
chains in static granular assemblies. These are chain- 
like structures in which the contact forces are larger 
than the mean force, and it is now generally accepted 
that they represent the microscopic force network that 
supports granular shear stress and rapidly reorganize 
under a change of directional loading. The microme- 
chanics determining the length of force chains is still 
poorly understood. 

Whatever the physical origins, a microscopic length 
scale serves as a parameter in various enhanced contin- 
uum models, including various micropolar and higher- 
gradient models, all of which represent a form of weak 
nonlocality referred to by the blanket term multipo- 
lar. Perhaps the simplest is the Cosserat model, often 
referred to as micropolar. This model owes its origins 
to a highly influential treatise on structured continua 
by the Cosserat brothers, Eugene and Frangois, which 
was celebrated internationally in 2009 on the occasion 
of the centenary of its original publication. Tejchman’s 
book provides a comprehensive summary of a fairly 
general form of Cosserat hypoplasticity. 

4.3.1 Particle Migration and Size Segregation 

Another fascinating aspect of granular mechanics is 
the shear-driven separation or “unmixing” of large par- 
ticles from an initially uniform mixture of large and 
small particles. Various models of particle migration in 
fluid-particle suspensions or size segregation in gran- 
ular media involve diffusion-like terms that suggest 
multipolar effects. While some models involve gravita- 
tionally driven sedimentation opposed by diffusional 
remixing, other models involve a direct effect of gradi- 
ents in shearing akin to those found in fluid-particle 
suspensions. 

It may be significant that many granular size-segrega- 
tion effects are associated with dense flow in thin lay- 
ers, which again suggests that Knudsen-number or mul- 
tipolar effects are likely. Whatever the origins of parti- 
cle migration, it can probably be treated as a strictly dis- 
sipative process, implying that it can be represented as 
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a generalized velocity in a dissipation potential, thus 
suggesting a convenient way of formulating properly 
invariant constitutive relations. 

4.3.2 Relevance to Material Instability 
There is a bewildering variety of instabilities in granu- 
lar flow. These range from the shear-banding instabili- 
ties in quasistatic flow discussed above to gravitational 
layering in moderately dense rapid flow and clustering 
instabilities in granular gases. 

The author advocates a distinction between mate- 
rial or constitutive instability, representing the instabil- 
ity of homogeneous states in the absence of boundary 
influences, and the dynamical or geometric instability 
that occurs in materially stable media, such as elastic 
buckling and inertial instability of viscous flows. With 
this distinction, it is easier to assess the importance of 
multipolar and other effects. 

Past studies reveal multipolar effects on elastoplastic 
instability, not only on post-bifurcation features such 
as the width of shear bands but also on the mate- 
rial instability itself. This represents an interesting and 
challenging area for further research based on the para- 
metric viscoelastic or hypoplastic models of the type 
discussed above. The general question is whether and 
how the length scales that lend dimensions to subse- 
quent patterned states enter into the initial instability 
leading to those patterns. 
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V.14 Modem Optics 

Miguel A. Alonso 

1 Introduction 

Since ancient times, optics has been inextricably linked 
to geometry and other branches of mathematics. Euclid 


himself wrote the earliest known treatise on optics, 
in which he postulated the laws of perspective. Simi- 
larly, Hero of Alexandria formulated perhaps the earli- 
est variational principle when he observed that a light 
ray traveling between two points with an intermediate 
reflection by a mirror corresponds to the shortest such 
path. The parallel development of optics and geom- 
etry continued in the Arabic world with the likes of 
Ibn-Sahl and Ibn al-Haytham (also known as Alhazen). 
More recently, modern vector calculus was inspired by 
the studies of Gibbs and Heaviside in electromagnetic 
theory. 

Modern optics is ubiquitous in contemporary science 
and technology. It provides direct tests for fundamen- 
tal physics and is the basis of applications including 
telecommunications, data storage, integrated circuit 
manufacture, astronomy, environmental monitoring, 
and medical diagnosis and therapy. Optics and pho- 
tonics incorporate many mathematical methods cov- 
ered by typical undergraduate curricula in mathemat- 
ics, physics, and engineering. Topics such as linear alge- 
bra, Fourier theory, separation of variables, and the 
theory of analytic functions acquire meanings that are 
palpably physical and, by the very nature of the topic, 
visual. This article is an overview of the mathematical 
techniques involved in the study of light, emphasiz- 
ing free-space propagation and some simple photonic 
devices. 

2 Free-Space Propagation 

Consider first the simple case of light propagating 
through free space. It is well known that maxwell’s 
equations [III.22] for the electromagnetic field can be 
combined into a vector wave equation whose solutions 
include waves that travel at a speed given by a com- 
bination of the free-space electric permittivity and the 
magnetic permeability. Maxwell, aware of the numeri- 
cal similarity of this combination with recent measure- 
ments of the speed of light, corrected the equations of 
electricity and magnetism in such a way that they could 
be combined into an equation that also described the 
phenomenon of light. 

As well as explaining the speed of light, Maxwell’s 
equations also explain the fact that light is a trans- 
verse wave: for an extended wave with fairly uniform 
intensity, the electric field (and magnetic field, which 
we do not consider further) is a vector whose direction 
is perpendicular to the wave’s direction of propagation. 
For simplicity, however, we ignore the vector nature of 
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the electric field and consider a scalar wave that satis- 
fies a similar wave equation (this may be a component 
of the vector field). We make the further simplifying 
assumption that at every spatial point, the dependence 
of the field on time t is a sinusoidal wave with fixed 
angular frequency to. Since our eyes translate tempo- 
ral frequency into color, such fields are referred to as 
monochromatic. A real monochromatic field E(r , t) can 
be defined as 

E(r, t ) = A(r) cos [<£(r) - cot], 

where A is called the amplitude and <P the phase, both 
being real functions of position r = (x,y,z). The inten- 
sity of light at each point is given by I = A 2 . The optical 
wave can therefore be written as 

E(r,t) = R e{U(r)e~ iwt }, 

where Re denotes the real part and U (r) = A(r)e 1<p(, ' ) 
is a complex function of r, independent of t. The wave 
equation then reduces to the Helmholtz equation 

V 2 U(r) + k 2 U(r) = 0, (1) 

with wave number fc = co/c = 2 tt/A, with c the speed 
of light in vacuum and A the wavelength. 

2.1 Modes of Free Space 

In many practical problems in optics one seeks to 
model the field’s propagation away from a given initial 
plane where it is known. For this purpose, one of the 
Cartesian axes, by convention the z-axis, is singled out 
as a propagation parameter. Suppose that we are inter- 
ested in modeling only the field for z ^ 0 and that the 
field’s source is located within the half-space z < 0, so 
within our region of interest light travels with increas- 
ing z. Planes of constant z are referred to as transverse 
planes. This allows us to introduce the concept of an 
optical mode. In this context, a mode is a field whose 
dependence on z is separable. The simplest mode is 
a plane wave, for which the amplitude is completely 
uniform: 

U m (r;u) = e ikur , 

where u = ( u x ,u y , u z ) is a unit vector that specifies 
the wave’s propagation direction. Since we assume for- 
ward propagation in z, plane waves for which u z < 0 
are not allowed, so u z can be expressed as a function 
of u x and u y , i.e., 

u z (u x ,u y ) = yjl -Ux~ Uy. (2) 

This in fact allows solutions where u x + Uy > 1, for 
which u z is purely imaginary. In this case, one must 


choose the branch of the square root that is positive 
imaginary, so that the field remains bounded for z > 0 
and U pw decays exponentially under propagation in z. 
Such waves are called evanescent waves, and they play 
an important role in fields close to interfaces. 

Forward-propagating plane-wave modes, together 
with forward-decaying evanescent modes, form a com- 
plete set over the initial transverse plane. Fields prop- 
agating toward larger values of z can therefore be 
expressed as a linear superposition of these waves: 

U (r) = IT U(u x ,u y )e lkur du x du y , (3) 

where we assume that integrals are over all real num- 
bers unless otherwise specified. U is a function known 
as the angular spectrum that determines the amplitude 
of each plane wave. Note that the angular spectrum 
is easily determined from the known field at z = 0, 
since at this initial plane (3) takes the form of a Fourier 
superposition, which can be easily inverted: 

U(u x ,u y )= JJ U(x,y,0)e^' k ^ XUx+yUy) dxdy. 

This gives the key result that the two-dimensional spa- 
tial Fourier transform of a known monochromatic field 
at some initial plane gives access to the amplitudes of 
the plane waves that compose such a field. 

This description of optical propagation can be sum- 
marized in terms of Fourier transforms, i.e., 

U(x,y,z ) = F~ 1 [e ikzu * iu *- u y ) FU(x,y, 0)], 

where 

Fg = g(x,y)e~ lk( - XUx+yUy) dx dy, 

F~ k g = |f a(u I ,u y )e it, ^ + ^' ) du x du y . 

Free propagation can thus be modeled by suitable 
Fourier transformations of the known two-dimensional 
initial field via the propagation transfer function e lkzUz . 
This is not surprising, as it is well known that systems 
that are both linear and translation invariant in x and 
y can be modeled by a transfer function. Free space is 
also translation invariant in z, and hence propagations 
over consecutive intervals must equal the single prop- 
agation over the total distance. In such systems, the 
transfer function must be an exponential whose expo- 
nent is proportional to z, so the product of the trans- 
fer functions for various distances equals the transfer 
function for their sum. 

The transfer function depends on the directional 
parameters u x and u y only through the combination 
u 2 + u y = u 2 . Therefore, linear superpositions of 
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plane waves for which u 2 ± is fixed preserve their trans- 
verse functional form and are therefore also modes that 
could be more useful in particular situations. A natu- 
ral choice for cylindrical light beams is the mode set 
labeled by polar coordinates u± , qp in the Fourier plane, 
so u x = Mi coscp and u y = u± sincp, for which the 
transfer function is Alternative complete 

families of modes result from combining mode fami- 
lies with different dependence on the azimuth qp. For 
instance, the angular spectrum may be expanded as a 
Fourier series in this angle, 

U(u± cos qp, Mi sincp) = X e lmip , 

m 

where the sum is over all integers. The plane-wave 
superposition is then replaced by 

r 00 

U(r) = X U m (u ± )J m (ku ±P )e l ^ m * +ku ^ du x , 

m Jo 

where J m (t) are bessel f unctions [III. 2] of the first 
kind, and p = Jx 2 + y 2 and c p = arctan(x,j/) are 
polar coordinates in the spatial transverse plane. The 
functions J m (ku ± p)e^ m ^ +kUi:Z ^ are a new set of com- 
plete orthogonal modes, known as Bessel beams. (When 
u± > 1, u z = (1 - u 2 l ) 112 is purely imaginary, so the 
corresponding Bessel beam is evanescent.) Unlike plane 
waves, Bessel beams have a nonuniform intensity distri- 
bution over transverse planes, taking maximum values 
near a ring of radius \ m\/ku±. Furthermore, for m * 0, 
these modes have an mth-order phase vortex on the z- 
axis. That is, on the line x = y = 0 the amplitude is 
zero and the phase varies by 2 ttjti as </> increases from 
0 to 2 tt. Optical phase vortices have been the focus 
of significant attention recently; among other reasons, 
this is because they endow beams with angular momen- 
tum. For example, a beam with a phase vortex along its 
axis is capable of making a small particle spin. Modeling 
such transfer of angular momentum, however, requires 
knowledge of the theory of light-matter interaction, 
which is not discussed here. 

Plane waves and Bessel beams are only two of an infi- 
nite range of possible modes, those whose transverse 
profiles are separable in Cartesian and polar coordin- 
ates, respectively. Other separable families include so- 
called Mathieu and parabolic beams (separable in ellip- 
tic and parabolic cylindrical coordinates, respectively), 
and there are also numerous nonseparable families. 
Notice that for all of these modes, the integral of the 
intensity over the transverse plane diverges, imply- 
ing that they are not physically realizable fields in a 
strict sense, as they would require an infinite amount 


of power. However, appropriate continuous superpo- 
sitions of them are square integrable over transverse 
planes and they are therefore physical. Furthermore, 
finite-power approximations to individual modes are 
achievable and useful for a range of applications. 


2.2 Paraxial Approximation 


Light beams often have strong directionality, in the 
sense that they can be expressed as superpositions 
of plane waves that all travel within a narrow angu- 
lar range with respect to a central direction, conve- 
niently the z-axis of the angular spectrum decompo- 
sition. In this case U differs significantly from zero 
only for |ii x |, \u y \ <K 1. Therefore, from (2), u z ~ 
1 - ( u x + Uy)/2, so that the plane-wave expansion 
becomes 


U : 




U exp[ifc[z + u x x + u y y 


jz(u x + u y )]] du x &u y . (4) 


Of course, since an approximation was performed, the 
superposition in this expression is no longer a solution 
of the Helmholtz equation. In fact, U p = Ue~ lkz is a gen- 
eral solution to the so-called paraxial wave equation, 
written here as 


i dU p 
k dz 



(5) 


where V x = d 2 /dx 2 + d 2 /dy 2 . This equation has the 
same mathematical form as the Schrodinger equation 
that determines the time evolution of a wave function 
in quantum mechanics [IV.23], with zero potential 
and z as the time propagation parameter. The similarity 
between the paraxial and Schrodinger equations sug- 
gests that the latter is also an approximation of a more 
precise equation under some limit. Solutions of the 
Schrodinger equation are approximations to solutions 
of relativistic quantum mechanical equations, such as 
the dirac equation [III.9], when all velocities involved 
are small compared with c. In other words, nonrela- 
tivistic wave functions are “paraxial” around the time 
axis, since they are composed of plane-wave compo- 
nents forming small angles with respect to the time axis 
(scaled by c to have spatial units) in space-time. 

Substituting U as a Fourier transform into (4) leads 
to an alternative expression for paraxial field propaga- 
tion in the form of a convolution, known as the Fresnel 
diffraction formula, 


U(r) = U(x' ,y' ,0)K(x — x',y — y' ,z) dx' d y' , 
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where the Fresnel propagator K is proportional to the 
paraxial approximation to a spherical wave: 


K(x,y,z ) 


iAz 


exp 


( ll ( 2 


+ y 

2z 


; ))- 


This description is related to Huygens’s interpretation 
of wave propagation: each point along a transverse 
plane can be regarded as a secondary wave whose 
amplitude and phase of emission are determined by the 
field at that point. This interpretation can be extended 
to quantum mechanics via the Schrodinger equation, 
where it is the basis of Feynman’s path integral formu- 
lation. 

The paraxial approximation allows the closed-form 
calculation of several types of fields. For example, 
let the initial field be a Gaussian function of x and 
y with height Uo and width wo, i.e., U G (x,y,0) = 
Uoe~ (x2+y2)/2w ° . The propagated paraxial field is 

Uo 


U G (r) = 


1 +iz/z R 


exp 


M 2 


x 2 + y 2 


2ivq(1 + i z/z R ) 


])• 


where z R = kw ( j is known as the Rayleigh range. The 
beam’s intensity remains Gaussian under propagation, 
broadening away from the waist plane z = 0: 


\U G (r)\ 2 = \U 0 \ : 


w 2 (z) 


exp - 


t x 2 + y 2 \ 

V w 2 (z) )’ 


where w(z) = wo(l + z 2 /z|) 1/2 . U G is a good approx- 
imation to the beams generated by standard continu- 
ous-wave lasers, such as laser pointers. This is because, 
within the paraxial approximation, they are modes of 
three-dimensional resonant laser cavities with slightly 
concave, partially reflecting surfaces resembling a seg- 
ment of a sphere. The light is transmitted out of the cav- 
ity through one of these surfaces after many reflections 
while keeping the same transverse profile. 

There is an interesting relation between Gaussian 
beams and the Fresnel propagator: in the limit of 
small wo, U G becomes this propagator, i.e., the paraxial 
counterpart to a diverging spherical wave: 


K(x,y,z) = lim 


U G (r) 


mu o 

tro-0 2ttWqUo 


Conversely, a Gaussian beam of arbitrary width wq can 
be expressed as proportional to a spherical wave that 
has been displaced in z by an imaginary distance zr = 
kw$: 


U G (r) = 17oAz R e kZR K(x,y,z -iz R ). 


There are families of fields that, like Gaussian beams, 
are modes of cavities with curved mirrors. While these 
are not strictly speaking modes of free space, their 


intensity profile is preserved upon propagation, with 
width proportional to w (z) defined above, and a corre- 
sponding longitudinal attenuation proportional to the 
inverse square of this factor. One such family is the so- 
called Hermite-Gaussian beams, where the initial field 
is the product of a Gaussian and appropriately scaled 
hermite polynomials [11.29] H n of different orders in 
x and y: 


U HG (r) 

u m.n v» / 


_w(z) 


(-!)]" 


for m,n = 0,1, Similarly, Laguerre-Gaussian 

beams are defined as 


Uf p (r) 



2 p+W 


x 1 U G (r) 

Lw 2 (z) ] 

for integer £ and p = 0,1,..., and where is an 
associated laguerre polynomial [11.29]. Like Bessel 
beams, Laguerre-Gaussian beams with £ 0 have 

an /Th-order phase vortex at the z-axis and there- 
fore carry orbital angular momentum. For fixed z, 
the Hermite-Gaussian beams are separable in two- 
dimensional Cartesian coordinates, while the Laguerre- 
Gaussian beams are separable in polar coordinates 
(and Gaussian beams are separable in each). A third 
family, known as the Ince-Gaussian beams, is separa- 
ble in elliptical coordinates. With appropriate constant 
prefactors, each of these families is orthonormal in 
any plane of constant z. Therefore, any paraxial field 
can be expressed as a discrete linear superposition of 
Hermite-Gaussian or Laguerre-Gaussian beams as 


00 00 00 
U(r) = ^ a m: nU^ G n = ^ ^ 

m,n = 0 p= 0£=-oo 

where a m , n and bf p are appropriate coefficients. 


2.3 Connection with Ray Optics 

Maxwell’s equations give a description of light as an 
electromagnetic wave. However, the older and sim- 
pler ray model is often closer to our experience-based 
intuition and is sufficient for many practical purposes 
including the design of simple imaging systems. To 
properly understand the connection between the ray 
and wave models, one must delve into the theory of 
asymptotics [II. 1]. In this section, however, we give a 
simplified version of the connection for basic optical 
systems in the paraxial approximation. 
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In the ray model, optical power travels along mutu- 
ally independent lines called rays , which are straight 
for propagation through homogeneous media such as 
free space. A ray crossing a plane of constant z is 
labeled by its intersection r± = (x,y), together with 
its direction, specified by the transverse part of its 
direction vector, u ± = (u x ,u y ). These define the ray 
state vector v = (r ± ,u ± ). In the paraxial approxima- 
tion and for simple systems, v evolves under propaga- 
tion according to simple rules. Under free propagation 
between two planes of constant z separated by a dis- 
tance d, the ray direction is constant while the trans- 
verse position changes by approximately du , . In this 
case, v — • F(d) ■ v under free propagation, determined 
by the 4x4 matrix 


F(d) = 



with 1 and O being the 2x2 identity and zero matrices, 
respectively. Similarly, propagation across a thin lens 
with focal distance / does not affect r± but does change 
the ray’s direction in proportion to r ± , so that the effect 
of the lens on the ray follows the linear relation v -> 
L(f) ■ v, with 


L(f) 




Other basic optical elements are also described by 
matrices that multiply the state vector. When a ray 
propagates along a series of elements, the state vector 
must be multiplied by the corresponding succession of 
matrices, so the complete system is characterized by a 
single matrix S given by the product of the matrices for 
the corresponding elements. This matrix is generally 
written in terms of its four 2x2 submatrices as 


A B 
C D 


and in the literature it is somewhat clumsily referred 
to as an ABCD matrix. 

A related description of propagation exists in the 
wave domain. It can be shown that the propagation of 
the wave field is given by the so-called Collins formula : 

U(r) = IT U(x' ,y' , 0)K c (x',y'’x,y) dx' d y', 

where the Collins propagator is 

P^i k(r[ ■B~ 1 A-r' ± -2r' 1 B 1 ■ r ± +r ± ■ DB _1 -r ) 
j^C _ yj 

iAydet(B) 

That is, for simple paraxial systems, the same matri- 
ces that describe the propagation of rays also describe 


the propagation of waves. For free propagation, S = 
F{z), and the Collins propagator reduces to the Fresnel 
propagator. 

2.4 Phase Retrieval 

When measuring light, CCD arrays and photographic 
film detect only a field's intensity I = \ U | 2 , so all infor- 
mation about the phase is lost. However, many applica- 
tions require accurate knowledge of the phase, which 
contains most of the information about the source 
that generated the field and the medium it traveled 
through. For example, in astronomical and ophthalmic 
measurements it is important to determine phase fluc- 
tuations caused by the atmosphere or the eye’s imper- 
fections for their real-time correction through adaptive 
elements. 

Several methods have been developed to retrieve the 
phase based only on the intensity. For example, know- 
ing the intensity at two nearby planes of constant z 
is nearly equivalent to knowing the intensity and its 
derivative in z. By writing [7 P = \/7e 1<f> and taking the 
real part of (5) times -ie~ 1<f , one finds the so-called 
transport-of-intensity equation : 

+ V ± ■ (iV ± 4>) = 0. 
oz 

Therefore, knowledge of I and its z derivative over a 
plane of constant z allows, in principle, the solution for 
<P(x,y) at that plane, provided appropriate boundary 
conditions are used. 

In other situations, the intensity is known at two dis- 
tant planes, between which there might be some optical 
elements. (In some cases, the intensity is known at one 
plane and only certain restrictions are known at the 
second, e.g., the support of the field.) Several strategies 
exist for estimating the phase in these cases, the best- 
known one being the Gerchberg -Saxton iterative algo- 
rithm. The form of the propagator between the planes 
(such as the Fresnel or Collins propagator) is assumed, 
then an initial field estimate at plane 1 is proposed 
whose amplitude is the square root of the known inten- 
sity and whose phase is an ansatz. This field estimate 
is then propagated numerically to plane 2, where the 
resulting intensity typically does not match the known 
one. A corrected estimate is then proposed by replacing 
the amplitude with the square root of the known inten- 
sity (or enforcing the known restrictions) while leaving 
the phase unchanged. This corrected estimate is then 
propagated back to plane 1, where the same amplitude 
replacement procedure is applied. Usually, after several 
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such iterations, the estimate converges to the correct 
held, giving access to the phase. However, there can 
be issues with convergence as well as with uniqueness 
of the solution, depending on the system’s propagator. 
For example, this approach would not be able to resolve 
the sign of £ of the propagating Laguerre-Gauss field 
introduced above. 


3 Guided Waves 

The great success of photonic technology stems in large 
part from the convenience of using light for transmit- 
ting and/or processing information. The basis of such 
technology is the ability to guide light through specially 
designed channels whose transverse dimensions can be 
of the order of the light’s wavelength. 


3.1 Waveguides and Optical Fibers 


The propagation of monochromatic light through a 
medium is determined approximately by a modified 
version of the Helmholtz equation (1): 

V 2 [/(r) + k 2 n 2 (r)U(r) = 0, 


where n(r) is known as the refractive index , which can 
vary with position. The real part of the refractive index, 
n r , determines the speed at which the field’s wave- 
fronts travel at that point in the medium, according to 
c/n r . On the other hand, the imaginary part, rq, deter- 
mines the rate of absorption of light by the material: 
after propagation by a distance d in a homogeneous 
medium, the amplitude is damped by a factor e -fcfl|d . 

In order to understand light guiding, we consider 
a medium in which n depends only on the trans- 
verse variables x, y. By using the paraxial approxima- 
tion, one finds a modified version of the paraxial wave 
equation (5), 


i 31 /p 
kn o dz 


1 

2 k 2 Ug 


V 2 U P 


np- n 2 (x,y) uV 
2 n 0 


where U p = Ue~ lkn ° z with no = n( 0,0). We simplify 
further by setting n as purely real and by assuming that 
it takes its largest value along the line x = y = 0, away 
from which it decreases monotonically as transverse 
distance increases. In this case, the paraxial wave equa- 
tion is analogous to Schrddinger’s equation for a quan- 
tum particle in a two-dimensional attractive potential, 
and therefore it accepts solutions that are concentrated 
within a region surrounding the z-axis. To see this, 
consider the case of a slab of transparent material 
with refractive index no restricted to \x\ ^ a, sur- 
rounded by a transparent medium of refractive index 


ni(< no). Note that this structure is independent of 
y. The paraxial wave equation then accepts separable 
solutions U p = f(x)e~ lkn ° Sz , where 5 is a real con- 
stant to be determined. Let A = no - hi <<c 1, so that 
ng - n\ » 2npA. Symmetric solutions for / can be 
found to approximately have the form 

\focos(kn 0 V26x), |x|^a, 

\fi exp[-kn 0 V2(4 - 8) \x\], \x\ > a, 

where 8 and the constant amplitudes fo and fi must 
be chosen so that fix) and its derivative are contin- 
uous. It is easy to find that this condition leads to a 
transcendental equation for 5: 

tan(knoV25fl) = yj ^ ~ ^ . (6) 

This equation has a finite, discrete set of solutions for 
5, restricted to the range (0, A). Therefore, only a finite 
set of even solutions for f exist where the field is con- 
fined exponentially to the central slab. Odd solutions 
also exist where, for \x\ ^ a, f is proportional to a sine 
rather than a cosine and the tangent in (6) is replaced 
by minus a cotangent. These even and odd solutions 
correspond to the modes of the waveguide, each of 
which preserves its transverse profile while accumu- 
lating a phase e lkno(1_5)z under propagation in z. (In 
more physical models, no and ni have small imaginary 
parts that cause a small amount of absorption.) Differ- 
ent modes accumulate different phases under propa- 
gation (given their different 8 values) since they prop- 
agate at different speeds. This effect, known as modal 
dispersion, can pose a problem for applications in data 
transmission, since multiple replicas of a signal arrive 
at different times at the end of the transmission line, 
scrambling the transmitted message. This problem can 
be avoided by using waveguides with sufficiently small 
width a and/or index mismatch A for only one mode 
to exist. 

Now consider an optical fiber, where the refrac- 
tive index depends only on p, the radial distance 
from the z-axis. So-called step-index fibers are com- 
posed of a cylindrical core of radius a with refrac- 
tive index no surrounded by a cladding with refrac- 
tive index n\ < no- Solutions of the relevant paraxial 
wave equation are similar to those for the slab. Using 
solutions (modes) separated in cylindrical coordinates 
U p = f ip)e lm ^ e~ lkn «Sz , one finds that, for p ^ a, the 
solutions for / are proportional to Bessel functions 
J m ( kno \[28 p ) while for p > a they are modified Bessel 
functions I m [knof2(A - 8)p], which decay rapidly as 
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p increases. The conditions of continuity and smooth- 
ness of / at p = a lead once more to a transcendental 
equation (now involving Bessel functions) that restricts 
the allowed values of 5. Again, for small enough a 
and A , only one mode is guided, and the structure is 
referred to as a single-mode fiber, free of modal disper- 
sion. For some applications, though, it is convenient 
to use multimode fibers accepting many modes. Note 
that the modes for which m 0 carry orbital angular 
momentum. 

3.2 Surface Plasmons 

Guided modes can also be localized around a single 
interface between two media provided that one of them 
is a metal. The refractive index of a metal, due to the 
presence of free electrons, is dominated by its imagi- 
nary part, explaining why these materials are opaque. 
To understand the resulting modes, known as surface 
plasmons, we need to consider the vector character of 
the field. These vector mode solutions are exponen- 
tially localized near the interface between a metal and 
a transparent material, chosen to be the x = 0 plane, 

\ae^ y ° x e^ z , x > 0, 

E(r) = \ 

^ be yix e utz i x<0t 

where the constant vectors a and b have zero y compo- 
nents for reasons that will soon become apparent. Let 
the refractive index of the transparent material (x > 0) 
and the metal (x < 0) be no and n\ = iv, respec- 
tively. In both media, E satisfies the Helmholtz equa- 
tion (V 2 + k 2 n 2 )E = 0 and the divergence condition 
V ■ E = 0, leading to the constraints Yo = P 2 ~ k 2 n jy 
Y 2 = P 2 + k 2 v 2 , a x Yo = i a z P, and b x y i = -i b z p. 

Different boundary conditions apply to field compo- 
nents that are tangent and normal to the interface: the 
tangent component must be continuous, as must be 
the longitudinal one times the square of the refractive 
index (for media with negligible magnetic response). 
This results in the relations a z = b z and nla x = 
-v 2 b x , which in combination with the four constraints 
found earlier give 

To _ Yi_ _ _P_ _ k 

”■0 v ' 2 n ov 2 _ n 2 ' 

That is, for surface plasmons to exist, the imaginary 
part of the metal’s refractive index must be larger than 
the real part of the transparent medium’s refractive 
index, and hence the exponential localization within 
the metal is faster. 


Surface plasmons are also supported by curved inter- 
faces such as the surfaces of wires or particles. They 
are the basis for many modern photonic devices like 
sensors (given the strong dependence of their behavior 
on shape, frequency, and material properties) and field 
enhancement techniques used in microscopy and solar 
cells. 

4 Time Dependence and Causality 

The ideas described so far rely on the assumption 
that the held is monochromatic, so that the tempo- 
ral dependence is accounted for simply through a fac- 
tor e~ ltot . While this assumption is useful when mod- 
eling light emitted by highly coherent continuous-wave 
lasers, it is an idealization as it implies that the held 
presents this behavior for all times. Fields with more 
general time dependence require a superposition of 
temporal frequencies: 

r CO 

E(r\t) = U(r; u))e~ la,t dco. 

Jo 

Note that this choice of Fourier superposition extends 
only over positive to, which implies that £ is a com- 
plex function. While the physical held represented by 
E corresponds to its real part, it is convenient to keep 
the imaginary part for several reasons. One is that the 
effective spectral support (the range in to where U 
takes on signihcant values) is reduced by at least half 
if negative frequencies are excluded, which is advan- 
tageous for, say, sampling purposes. The second and 
more important reason is that this complex represen- 
tation allows us to preserve the concept of phase, which 
is very useful as it is often connected to physical param- 
eters in the problem. This complex held representa- 
tion, known as the analytic signal representation, can 
be obtained from the real held by Fourier transform- 
ing, removing the negative-frequency components and 
doubling the positive-frequency ones, and then inverse 
Fourier transforming. Equivalently, one can simply add 
to the real held its temporal Hilbert transform times 
the imaginary unit i. 

The optical properties of a material or system depend 
on the held’s frequency, and can also be represented 
by complex functions of u). These properties include 
the refractive index n(cv) or the propagation transfer 
functions e 1 ^"' 2 for guided modes in waveguides (with 
p = («no(l-5)/c)orplasmons. Such properties appear 
in products with the held, so that in the time domain 
they appear in convolutions with the held: 

F^iglwilHw)] = | E(t')g(t - t') d t' , 
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where g denotes any such quantity. The integral in 
t' can be interpreted as the superposition of medium 
responses g (that depend on the time delay t — t') at 
the “present” t due to external field stimuli E at all 
“past times” t' . However, from physical considerations 
the medium cannot predict the future, so the integral 
should only extend over t' e (- 00 , t], which can be 
enforced if g(r) = 0 for all t < 0. This implies that 
g(co) must be analytic over the upper half complex 
(o-plane, all singularities being at complex frequencies 
with negative imaginary part. The locations of these sin- 
gularities are directly related to physical properties of 
the medium; the real part usually corresponds to a res- 
onant frequency, while the imaginary part is related to 
the resonance’s width. Given these properties, it can 
easily be shown through standard residue theory that 
one can calculate (via Hilbert transformation) the imag- 
inary part of g given the knowledge of the correspond- 
ing real part for all real frequencies, and vice versa. 
Consider the case of the refractive index of a material. 
We can measure n r (io) for a sufficiently wide range in 
10 by measuring refraction angles at an interface and 
using Snell’s law. Similarly, we can measure ni(io) by 
measuring light absorption for the same range of fre- 
quencies. The analytic relation reveals that these two 
physical aspects of wave propagation that appear to 
be completely separate are in fact intimately linked, to 
the point that in theory only one of the measurements 
is needed to predict the results of the other. 

5 Concluding Remarks 

The brief overview of mathematical aspects of optics 
and photonics given here is by no means comprehen- 
sive. Optics is an extremely broad discipline encom- 
passing pure science and engineering, and as such the 
range of mathematics used in its study is immense. It 
is the author’s hope that this brief overview will moti- 
vate the reader to explore the vast literature in this 
field. 
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V.15 Numerical Relativity 

Ian Hawke 


1 The Size of the Problem 

There is a beautiful simplicity in Einstein's field 
equations [III. 10] (EFEs), 

Gab = 8nGT ab , ( 1 ) 

relating the curvature of space-time, encoded in the 
Einstein tensor G ab , to the matter content, encoded 
in the stress-energy tensor T ab . When expanded as 
equations determining the space-time metric g ab , the 
EFEs are a complex system of coupled partial differ- 
ential equations, which are analytically tractable only 
in situations with a high degree of symmetry or if one 
solves them perturbatively. For many mathematically 
and physically interesting cases— such as the evolution 
of the very early universe or the formation, evolution, 
and merger of black holes— the fully nonlinear EFEs 
must be tackled; this requires a numerical approach. 
The aim of this approach is to construct, from given 
initial data, the space-time metric g ab and the matter 
variables that form the stress-energy tensor T ab over 
as much of the space-time as possible. 

At first glance, we could take the partial differen- 
tial equations for g ab and solve them using standard 
numerical methods such as those based on finite 
differences [IV. 1 3 §3]. However, numerical calcula- 
tions require specific coordinate systems and cannot 
cope with the infinities associated with singularities, 
whether real or due to purely coordinate effects. The 
fundamental steps of numerical relativity are, there- 
fore, to first transform the EFEs (1) into a form suit- 
able for numerical calculation, then to choose a suitable 
coordinate system that covers as much of the space- 
time as possible while avoiding singularities, and finally 
to interpret the results. 

2 Formulating the Equations 

Our first task is to break the coordinate-free nature of 
the EFEs and introduce “time” and “space” coordinates. 
This will allow us to write the EFEs in a form reminis- 
cent of the wave equation [111.31] and therefore suit- 
able for numerical evolution. Even before discussing 
explicit choices of coordinates, significant mathemat- 
ical questions arise; while the motivation of this sec- 
tion is to prepare the ground for numerical work, 
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nothing numerical will occur until we explicitly choose 
coordinates in section 3. 

2.1 The 3 + 1 Decomposition 

There are many ways of splitting the space-time that 
are suitable for numerical evolution. The standard 
approach starts by introducing a coordinate time t, a 
scalar field on space-time. The four-dimensional space- 
time is then foliated into three-dimensional “slices,” 
where on each slice t is constant. We must restrict t 
so that each slice is space-like. The line element thus 
takes the form 

d5 2 = 3abdx a dx b 

= (-ex 2 + Pill 1 ) df 2 + 2 pi df dx' + yij dx l dx J . 

The three-dimensional metric yy measures proper dis- 
tances within the slice of constant t = dx°, and here 
the spatial indices i,j, ... = 1,2,3, while the space- 
time indices a,b, . .. = 0,1, 2, 3. Relativistic index nota- 
tion, as used in general relativity and cosmology 
[IV.40] and tensors and manifolds [11.33], is used 
throughout. The lapse function a measures the proper 
time t between neighboring slices: dT = cxdt. The 
shift vector P 1 measures the relative velocity between 
observers moving perpendicular (usually called nor- 
mal) to the slices and the lines of constant spatial 
coordinates, x\ +At = x\ - p l dt. All of the new functions 
that we have introduced, {a, j8\yy}, are themselves 
functions of the coordinates ( t,x l ). 

The coordinate freedom is contained within the lapse 
and shift. The full metric is determined from the field 
equations (1), which must be rewritten to determine the 
spatial metric yy. 

2.2 A Slice Embedded in Space-Time 

The intrinsic curvature of the slice can be found from 
the spatial metric yy in the same way as the full space- 
time curvature is found from the space-time metric. 
The remaining information about the space-time cur- 
vature, necessary to describe the full space-time at 
a point, is contained in the extrinsic curvature that 
describes how the three-dimensional slice is embedded 
in the four-dimensional space-time. 

We require two key vectors. The first is the vector nor- 
mal to the slice, n a = (-«,0). The second is the time 
vector t a = an a + p a , which is tangent to the lines of 
constant spatial coordinates. The projection operator 
Pft = 5% + n a n b acts on any tensor to project it into 


the slice. For example, P§t b = p a , as the piece normal 
to the slice is “projected away.” The extrinsic curvature 

Kab = ~P c a Vcn b = -(V a n b + n a n c V c n b ) 

measures how n a changes within the slice, where V a 
is the covariant derivative. 

By defining the spatial (three-dimensional) covariant 
derivative Dj = PfV a , the definition of the extrinsic 
curvature can be used to show that 

3 tyy = -2aKij + D ipj + Djpi. (2) 

So if the extrinsic curvature is known, we can solve for 
the spatial metric. 

2.3 The ADM Equations 

The field equations (1) can be projected in two ways: 
either normal to the slice using n a or into the slice 
using Pf. Projections including n a give 

i3) R + K 2 - KijK ij = 16nn a n b T ab := 16ttp, (3) 
Dj(K ij - y ij K) = -8nP ia n b T ab := 8nf. (4) 

The 3-Ricci scalar (3) R is determined from the Riemann 
curvature tensor of the spatial metric yy, and K = K\ 
is the trace of the extrinsic curvature. These constraint 
equations do not involve time derivatives and are inde- 
pendent of the coordinate gauge functions «, P 1 . They 
constrain the admissible solutions inside a slice based 
on its geometry and the matter sources. 

Projecting the field equations purely into the slice 
gives an evolution equation for the extrinsic curvature, 
which can be written as 

dtKij = p k d k Kij + K ki djP k + K kj diP k - Di Djtx 
+ «[ 3 Ry + KKij - 2 K ik K k ] 

+ 4na[yij(S - p) - 2Sy]. (5) 

Here we have defined the projected stress term S ab '■= 
PaP b Ted- The equations (2) and (5) (which, when com- 
bined with a choice of coordinate gauge, determine the 
evolution of the space-time) are typically referred to 
as the ADM equations (after Arnowitt, Deser, and Mis- 
ner) in the numerical relativity literature. However, this 
system is mathematically not equivalent to the origi- 
nal system derived by Arnowitt, Deser, and Misner. The 
difference is the addition of a term proportional to the 
Hamiltonian constraint (3), so the solutions should be 
physically equivalent. 

This highlights a theme of research in numerical rel- 
ativity over the past twenty years: the search for “bet- 
ter” formulations of the equations, either by changing 
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variables or by adding multiples of the constraint equa- 
tions (which should, for physical solutions, vanish). 
One mathematical problem is finding criteria for the 
system of equations that ensure accurate and reliable 
numerical evolutions. 

2.4 Hyperbolicity 

Two key criteria have been considered in numerical rel- 
ativity, both borrowed from general numerical analy- 
sis. First, the system of equations should be well-posed : 
solutions to the system should depend continuously on 
the initial data. Formally, we define this to mean that 
solutions u(t,x) must in some norm || ■ || satisfy 

\\u(t,x)\\ ^ Ce w ||M(0,x)H, (6) 


2.5 Current Formulations 

As the standard ADM equations are unsuitable for 
numerical evolution, considerable effort has been put 
into finding better formulations. There have been two 
approaches: the first, more experimental, approach 
started from the ADM equations and attempted to find 
the minimal set of modifications that lead to an accept- 
able formulation; the second approach starts from 
the mathematical requirements of well-posedness and 
hyperbolicity and systematically constructs acceptable 
formulations. The two most widely used formulations 
are as follows. 

2.5.1 BSSNOK 


where the constants C, k must be independent of the 
initial data. The state vector u, which for the ADM equa- 
tions is w = ( a, yy, Ay) T , should not be confused 
with a space-time vector. An example system that is not 
well-posed is 



where A, B are constants and i = V-T- As the solution 
for u i grows as te kt , the bound (6) does not hold. Sys- 
tems of this form are particularly relevant for the ADM 
equations. 

A closely related concept is that of hyperbolicity. For 
systems of the form 

d t u + M'diU = s(u), 

we can analyze the system solely from the set of 
matrices ML Specifically, choosing an arbitrary direc- 
tion specified by a unit vector m, the principal symbol 
P(ni) := M l rii is a matrix determining the hyperbolic- 
ity of the system. If P has real eigenvalues and a com- 
plete set of eigenvectors, the system is strongly hyper- 
bolic] if its eigenvectors are not complete, the system 
is weakly hyperbolic. If the eigenvalues are not real, the 
system is not hyperbolic. 

The central result is that only strongly hyperbolic 
systems are well-posed. The original ADM equations 
are not hyperbolic. The standard ADM equations in 
section 2.3 are weakly hyperbolic in general (in low- 
dimensional cases such as spherical symmetry, the sys- 
tem may become strongly hyperbolic). Therefore, nei- 
ther system is useful for large simulations without 
symmetries. 


The BSSNOK formulation originates from a modifica- 
tion of the ADM equations introduced by Nakamura, 
Oohara, and Kojima in 1987, which was then modi- 
fied by Shibata and Nakamura and systematically stud- 
ied by Baumgarte and Shapiro. It introduces a num- 
ber of new auxiliary variables with constraints result- 
ing from their definition. It is written in terms of the 
conformal metric yy, which is the rescaling yy := 
e^ 4< ^yy such that yy has unit determinant. The evo- 
lution equation for the conformal metric now depends 
on the conformal traceless extrinsic curvature Ay = 
e _4< ^[Ay - jyyJT]. Finally, the contracted Christof- 
fel symbols of the conformal metric f ! := y^ k fj k = 
-3 jyV are used, leading to a formulation containing 
the seventeen equations 

/ -2 a Ay \ 

~S«K 

e" 4 *ffy + «(A'Ay - 2 A ik A k ) 

-DjD'a + a(AyA« + 5A 2 ) 

I 

+ matter terms, (7) 

where ffy is the trace-free part of -D, Dja + «Ay, Note 
that the constraints (3), (4) have been used to elimi- 
nate certain terms, and d/d t = 3 t — £p, where £ v is the 
Lie derivative giving the change of the quantity along 
integral curves of v, which obeys, for example, 

£ v 4> = v c V c <p , 

£ v w a = v c V c w a ~ w c 57 c v a , 

£ v w a = v c V c w a + w c V a v c , 

■PvTab — V C 57 c Tab + T C ^V a V C + frlA . 


d_ 

dt 


Wi\ 

Aij 

K 

\n J 
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The motivations for introducing these new variables 
include 

• better control over coordinate conditions, as dis- 
cussed in section 3, and 

• the fact that all second derivative terms in (7) 
appear as simple scalar Laplace operators. 

Empirically, the key step was using the momentum 
constraint (4) in the reformulation of the evolution 
equation for the f It was only considerably after its 
introduction that the BSSNOK system was shown to be 
strongly hyperbolic. 


2.5.2 The Generalized Harmonic Formalism 


The generalized harmonic formalism starts from the 
original proof, by Choquet-Bruhat, that the EFEs are 
well-posed. That used a specific coordinate system 
( harmonic coordinates defined by d’Alembert’s equa- 
tion 0 = dx c =: V b V b x c ) to show that the EFEs (1) 
could be rewritten such that the principal part resem- 
bled the simple wave equation. Mathematical require- 
ments such as well-posedness are therefore straight- 
forward to prove. 

For numerical evolution, relying on a specific choice 
of coordinates is frequently a bad idea, as will be seen 
later. Instead, the generalized harmonic formulation 
relies on a set of arbitrary functions H c = nx c , which 
allow the EFEs (1) to be written as 


\g cd gab,cd + g c { d a gb)d,c 

+ H (aJ>) - H d r* b + r§ d r* c = -8n(T ab - \g ab T). 


Again the principal part of the system can be related to 
the wave equation and, provided suitable equations are 
given to evolve the arbitrary functions H c , the system 
is manifestly hyperbolic. The arbitrary functions H c do 
not play exactly the same role as the gauge functions 
(cx,/T) do in standard formulations such as BSSNOK 
but instead determine their evolution, as expressing the 
definition of the H c functions in 3 + 1 terms gives 


(St ~ P k S k ) ( 


( -a(H t - p k H k + aK) \ 

\ag lj [odHj +g kl r jk i) - 3 ja\) ' 


It is expected that for every coordinate choice there will 
be a suitable choice of functions H c , and vice versa, so 
the standard approach within the field is to consider 
the gauge functions (tx, /?' ). 


3 The Choice of Coordinates 

In the 3 + 1 picture we are free to choose the gauge 
(«,£*) as we wish. Elowever, to be suitable for numer- 
ical evolution, the gauge must satisfy some additional 
criteria. In particular, the choice of gauge must avoid 
the formation of coordinate singularities (as illus- 
trated by standard Schwarzschild coordinates) and 
avoid reaching physical singularities, as the represen- 
tation of the infinities involved in these singularities is 
problematic in numerical calculations. In addition, the 
conditions must be well well-behaved mathematically, 
retaining the well-posedness of the system. Ideally, the 
conditions should also easy to implement numerically, 
respect underlying symmetries of the space-time, and 
not complicate the analysis of the results. Clearly this 
is a difficult list of criteria to meet! 

The slicing condition, specifying the lapse function 
a, has been the focus of most work in this area. As the 
lapse describes the relation between one slice and the 
next, while the shift /}* “merely” describes how the 3- 
coordinates are arranged on the slice, it is clear that the 
lapse determines whether the slice intersects with the 
physical singularity. To discuss the effect of the gauge 
it is useful to borrow the idea of an Eulerian observer 
from fluid dynamics: an idealized observer who stays 
at a fixed spatial coordinate location. A useful quantity 
in analyzing slicing conditions is the acceleration a c 
of the Eulerian observers n c given by a c := n b V b n c . 
Expressing this in terms of the gauge, we obtain 

a t = P k d k log(cx), at = 3* log(cx), 

where a t = a o is the “time” component of the acceler- 
ation, illustrating the problems that would arise from 
coordinate singularities. We also note that the defini- 
tion of the extrinsic curvature implies 

V c n c = -K, 

illustrating the potentially catastrophic growth (or col- 
lapse) of the volume elements (given by n c ) should the 
extrinsic curvature diverge. 

3.1 Geodesic Slicing 

The simplest slicing condition is to set tx = 1; each slice 
is equally spaced in coordinate time. This geodesic slic- 
ing condition means that the acceleration of the Eule- 
rian observers vanishes; they are in free fall. This is 
catastrophically bad for two reasons. 

First, the standard calculation for a particle freely 
falling into a (Schwarzschild) black hole shows that it 
reaches the singularity within a time t = ttM, where M 
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is the black hole mass. The geodesic slicing condition 
will not therefore avoid physical singularities. 

Second, considering the evolution equations for the 
extrinsic curvature (and simplifying such that j3‘ = 0) 
we find, for geodesic slicing, that 

d t K = KijK ij + 4tt(p + S). 

Assuming the strong energy condition, the right-hand 
side is strictly positive and hence K increases with- 
out bound, meaning the volume elements n c collapse 
to zero. This indicates that a coordinate singularity is 
inevitable when the extrinsic curvature is nontrivial. 

3.2 Maximal Slicing 

To avoid the “focusing" problem suffered by geodesic 
slicing, we could insist that the volume elements 
remain constant, i.e., K = 0 = dtK. This condition, 
known as maximal slicing, satisfies 

D 2 « = a[KijK ij + 4n(p + S)]. 

The “maximal” part of the name comes from the proof 
that, when K = 0, the volume of the slice is maximal 
with respect to small variations of the hypersurface 
itself. 

Maximal slicing leads to an elliptic equation, with 
the key advantages that the slice will be smooth and 
will avoid physical singularities. To illustrate the behav- 
ior of the slice, Smarr and York introduced a model in 
which the space-time is spherically symmetric, the spa- 
tial metric yy is flat, the scalar curvature R is a con- 
stant R 0 within a sphere of volume ro and zero out- 
side, and there is no matter. This is inconsistent (as 
the scalar curvature clearly depends on the spatial met- 
ric) but sufficiently accurate to be useful when inter- 
preted in a perturbative sense. Using the Hamiltonian 
constraint (3), the maximal slicing equation in vacuum 
can be rewritten as D 2 « = aR. We thus find that for 
this model the maximal slicing equation becomes 

1 d / 2 d \ _ JaRo r < r 0 , 

r 2 dr V dr / |o otherwise, 

with solution 

1 sinh(rV^o) 

01 } r sJRq cosh(ro^Ro) ’ r<r °~ 

The minimal value of the lapse occurs at the origin and 
is cx nl i n = a(0) = l/cosh(ro^Ro) ~ e~ r °. This expo- 
nential collapse of the lapse is a standard feature of 
singularity-avoiding slicings. 

While maximal slicing avoids both physical and coor- 
dinate singularities, it has a number of problems that 


restrict its use in numerical simulations. First, as an 
elliptic condition it can be difficult to solve efficiently 
and accurately on grids adapted for solving hyperbolic 
equations. Elliptic equations also depend strongly on 
the boundary conditions, which for astrophysical prob- 
lems should be enforced at infinity, where the space- 
time is expected to be flat. This is not possible on a stan- 
dard finite numerical grid, leading to inherent approxi- 
mations. Most important, however, is the phenomenon 
of slice stretching. In the vicinity of a black hole the slice 
is sucked toward the singularity (as observers infall) 
and also wraps around it (as the slice avoids the singu- 
larity). This leads to a peak in the metric components; 
for a Schwarzschild black hole the peak is in the radial 
components and grows as t 4/3 , where t is proper time 
at infinity. 


3.3 Hyperbolic Slicings 


While the key problem with maximal slicing is the 
stretching of the slice (a 3-coordinate effect that 
requires a shift condition to rectify), considerable effort 
has been invested in hyperbolic slicing conditions. The 
two main reasons for this are that the computational 
effort required is considerably reduced when using sim- 
ple numerical techniques and that the analysis of the 
well-posedness of the full system, coupling any for- 
mulation in section 2 to the gauge, is considerably 
simplified. 

The simplest hyperbolic slicing condition is the har- 
monic gauge condition nx c = 0 introduced in sec- 
tion 2.5.2. Written in terms of the gauge variables, this 
becomes 


d_ 

dt 


a = -ecK 


dt V y/Y ) 


= o, 


where y is the determinant of the 3 -metric yy. Har- 
monic slicing thus directly relates the lapse to the vol- 
ume elements of Eulerian observers. However, numer- 
ical experiments rapidly showed that harmonic slicing 
is not singularity avoiding. 

The Bona-Masso family of slicings is an ad hoc 
generalization satisfying the condition 


d_ 

dt 


« = -<x 2 f(a)K, 


where /(a) is an arbitrary, positive function. Note that, 
if /(a) = 2 / a, this can be directly integrated to give 
a = 1 + log(y), the “1 + log slicing” condition, which 
closely mimics maximal slicing. Harmonic slicing is 
clearly given by /(a) = 1. 
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To study the singularity- avoiding properties of the 
slicing, note that as 3 t y 1/2 = -y 1/2 txK (when the shift 
vanishes) we have 

dlog(y 1/2 ) = d< * 

cxf(cx) 

Thus, y 1/2 oc exp[J(a /(«)) _1 da], and it follows that, 
as / is positive, the lapse must collapse as the volume 
elements go to zero in the approach to the singularity. 
For the specific case of 1 + log slicing we have y 112 oc 
e“ /2 , which is finite as the lapse collapses, implying sin- 
gularity avoidance. For the case of harmonic slicing we 
have y 1/2 oc a, so as the lapse collapses the volume ele- 
ments do as well. This marginal singularity avoidance 
is insufficient for numerical simulations. 

3.4 Shift Conditions 

As with the development of hyperbolic slicing con- 
ditions, the standard hyperbolic shift conditions are 
motivated by elliptic conditions. The distortion tensor 
Sij = 2 T 1/3 -CtYij is essentially the velocity of the con- 
formal metric. The minimal distortion condition, which 
minimizes the integral of SijS 12 over the slice, is 

D jl ij = 0. 

When rewritten in terms of variables introduced for 
the BSSNOK formulation, this becomes the Gamma 
freezing condition 

d t r = 0. (8) 

Driver conditions— hyperbolic equations that aim to 
asymptotically satisfy elliptic conditions such as ( 8 ) — 
have become the method of choice in the field. The 
Gamma driver condition 

3 t p { = 3 t B i = dtf j - t]B' 

is intended to resemble a wave equation. With it, we 
assume there exists a (slowly varying) stationary state 
for the shift satisfying ( 8 ), to which the “current” shift 
is a small perturbation. The driver condition then prop- 
agates and damps (using the hand-tuned coefficient 17 ) 
these perturbations, driving the final result to a sta- 
tionary Gamma freezing state. While this condition has 
been used successfully for evolving single black holes, 
once a black hole moves on the grid (as it must in the 
example of binary black holes), it is necessary to mod- 
ify the gauge to the moving puncture condition, where 

3t - d t ~ P k d k . 

The combination of 1 + log slicing with moving punc- 
ture gauges produces the “trumpet” slice, which has 
been shown to cover the exterior and part of the inte- 
rior of the space-time of a Schwarzschild black hole, 



Figure 1 A Penrose diagram showing the relationship 
between the initial and evolved wormhole slices and a 
trumpet slice. The heavy dots represent the distribution 
of numerical grid points. (Reprinted figure with permission 
from David J. Brown, Physical Review D 80:084042 (2009). 
Copyright (2009) by the American Physical Society.) 


without coordinate singularities, while settling down to 
a steady state. The slice does not cover the entire space- 
time; in the wormhole picture, the slice just penetrates 
the “neck” or “throat,” as shown in figure 1. 


4 Covering the Space-Time 


The simplest example of a numerical space-time con- 
siders 1 + 1 dimensions with Cartesian coordinates and 
no matter. Unfortunately, there can be no interesting 
dynamics: the Riemann tensor has only one indepen- 
dent component, which must vanish, so the space- 
time is Minkowski. However, this case can be used to 
illustrate the techniques and, in particular, the gauge 
dynamics. In this restricted case we will use the nota- 
tion g = g xx and K = K x for the only nontrivial 
components of the metric and the extrinsic curvature. 

The ADM equations together with 1 + log slicing can 
be written, in 1 + 1 dimensions, in the following form: 



' a N 
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where the notation D a := 3 x log(a), D g := d x 


:log (g) 

has been used. Only the last three components con- 
tribute to the principal symbol 


P(ni = ( 1 , 0 , 0 )) = 


( 0 
0 

K a/g 1/2 


0 2/g 112 ^ 

0 2 <x/g 112 

0 0 , 


686 


V. Modeling 


which has eigenvalues and eigenvectors 


A 0 = 0, A+ = ± /^*, 

V 9 





' 2/£X \ 

eo = 

1 

w 

, e± = 

2 

^±V27 a, 


These are real and distinct when the lapse is positive, 
so in this special case the system is strongly hyperbolic. 

By diagonalizing the principal symbol, we see that 
the eigenfunctions co+ = g ll2 K ± £>«(«/ 2) 1/2 satisfy 
the advection equations 

3 t to ± + 3x(A±co±) = 0, 


while the eigenfunction coo = cnD a / 2 - D g /2 does 
not evolve. However, the characteristic speeds A+ are 
themselves evolving. It is straightforward to check that 


3f A± + A ± 3 X A ± 


cxA ± 


17, i \ 

/ 2 1 

1 + - ± 

7 

s 

O 

LV «/ 

a J 


If we start from a region with, for example, co- = 0 = 
coo, then this equation for A+ is equivalent to the burg- 
ers equation [III.4], for which the solutions generically 
become discontinuous in finite time. A discontinuity 
in A+ must correspond to a discontinuity in g or a, a 
gauge pathology that is completely unphysical! 

It should be noted that these gauge pathologies rarely 
occur in evolutions of physically interesting space- 
times. Firstly, the restriction to a flat space-time signif- 
icantly simplifies the analysis but drops the curvature 
terms, which are important in most cases. Secondly, the 
use of a shift condition, such as the moving puncture 
gauge outlined in section 3.4, adds a number of terms 
to the principal symbol, changing the nonlinear dynam- 
ics. The results in this section do illustrate, however, 
the essential importance of ensuring a well-behaved 
coordinate system in numerical evolutions. 


5 Evolving Black Holes 

To evolve fully general space-times without symme- 
tries, one of the complete formulations outlined in sec- 
tion 2.5 must be used. In special circumstances such 
as spherical symmetry, it is possible to use reduced 
systems; it is even possible to use variants of the sim- 
ple ADM equations. This allows detailed understanding 
of the numerical behavior of the evolution to be built 
up in simpler situations: for example, the only spheri- 
cally symmetric black hole space-time is Schwarzschild. 
As detailed in general relativity and cosmology 


[IV.40], there are a wide range of coordinate represen- 
tations of the Schwarzschild metric, but (as explained 
in section 4) not all are suitable for numerical evolu- 
tion. The additional regularity problems at the origin 
in spherical symmetry mean that some care is needed 
that is not necessary in the general case. 

One key question that can be studied in spherical 
symmetry is whether the formulation used for the 
numerical evolution respects the physics being mod- 
eled. For a black hole space-time, the main point is 
that no information should leave the horizon. As the 
eigenvalues of the principal symbol correspond to the 
“speed” with which information (specified by the asso- 
ciated eigenvalues) is propagating, the mathematical 
question to check is whether the radially “outgoing” 
eigenvalues are negative with respect to the horizon. 

This question depends on the choice of formula- 
tion and gauge. For a specific variant of the BSSNOK 
formulation with the standard gauge, the eigenvalues 
are 

±1, ± V2 / tx, 0, /T, ± («x)~ 1/2 , 

where x = e 4 ^’ is related to the metric determinant, 
and jl r = -Jg rr lx P r I <x is the proper shift per unit 
length per unit time. Note that some of the information 
can propagate faster than light, particularly where the 
lapse collapses. However, by checking the correspond- 
ing eigenvectors it can be shown that no information 
about the physics propagates outside a region within 
the horizon. 

There are additional numerical reasons for perform- 
ing such a mathematical analysis for black hole space- 
times. While the use of moving puncture gauges has 
been remarkably successful for evolving black hole 
space-times, the physics suggests another solution for 
avoiding the singularity: excising it from the numeri- 
cal grid completely. This requires placing a boundary 
inside the numerical domain, within the horizon, so 
that the infinities associated with the singularity have 
no effect. Of course, this means that boundary condi- 
tions must be imposed on this surface and the inher- 
ent errors introduced from these conditions must not 
be allowed to propagate outside of the horizon. As 
seen for the horizon above, the sign of the eigenvalues 
of the principal symbol will determine which informa- 
tion is “coming into” the numerical domain and hence 
must be set by the boundary conditions. This tech- 
nique has been particularly successful when combined 
with the generalized harmonic formalism described in 
section 2.5.2. 
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6 Extensions and Further Reading 

The extension of the methods outlined above to cases 
without symmetries (which is necessary for studying 
cutting-edge cosmological models or the astrophysical 
problem of binary black hole merger) is well covered in 
a number of references. This article has relied heavily 
on Alcubierre’s excellent book, and a complementary 
viewpoint is given by Baumgarte and Shapiro. For the 
inclusion of relativistic matter, necessary for both mod- 
eling the early universe and for astrophysical applica- 
tions such as supernovas and neutron stars, see the 
book by Rezzolla and Zanotti. 

In addition, relativity in general is extremely well 
served by Living Reviews, a series of online review arti- 
cles covering the field that can be found at www.living 
reviews.org. Within numerical relativity there is a range 
of articles on important issues not touched on above. 
One such issue is the construction of initial data for the 
numerical evolution that satisfies any constraint equa- 
tions while also being physically meaningful. A second 
issue concerns black holes in a dynamical space-time: 
what types of horizons are meaningful and useful, and 
how can they be found in numerical data? 

Finally, we should note the key topic that has driven 
large parts of modern numerical relativity: the quest 
to detect gravitational waves. These “ripples in space- 
time” carry energy and information from violent astro- 
physical events, such as binary black hole mergers. 
There are a number of currently running detectors that, 
when they successfully measure gravitational waves, 
will give unprecedented insight into the physics of 
strongly gravitating systems. The development of the 
mathematical theory of gravitational waves can be 
traced from the classic texts of Misner, Thorne, and 
Wheeler, and dTnverno through to the recent mono- 
graphs and review articles mentioned above. With 
recent advances in computational power, numerical 
methods, and mathematical techniques, numerical rel- 
ativity can simulate the gravitational waves from binary 
black hole mergers with sufficient accuracy for current 
detectors, and research into relativistic matter, such as 
binary neutron star simulations, continues. 
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V.16 The Spread of Infectious Diseases 

Fred Brauer and P. van den Driessche 


1 Introduction 

The first recorded example of a mathematical model for 
an infectious disease is the study by Daniel Bernoulli in 
1760 of the effect of vaccination against smallpox on 
life expectancy. This study illustrates the use of math- 
ematical modeling to try to predict the outcome of a 
control strategy. An underlying goal of much mathe- 
matical modeling in epidemiology is to estimate the 
effect of a control strategy on the spread of disease 
or to compare the effects of different control strate- 
gies. Another striking example is the work of Ronald 
Ross on malaria. Fie received the second Nobel Prize 
in Physiology or Medicine for his demonstration of 
the mechanism of the transmission of malaria between 
humans and mosquitoes. However, his conclusion that 
malaria could be controlled by controlling mosquitoes 
was originally dismissed on the grounds that it would 
be impossible to eradicate mosquitoes from a region. 
Subsequently, Ross formulated a mathematical model 
predicting that it would suffice to reduce the mosquito 
population below a critical level, and this conclusion 
was supported by held trials. 

The idea that there is a critical level of transmissi- 
bility for a disease is a fundamental one in epidemiol- 
ogy, and it developed from models rather than from 
experimental data. Some of the predictions of infec- 
tious disease models may be counterintuitive. While 
it is considered obvious that treatment of a disease 
should decrease the prevalence of the disease (i.e., the 
proportion of people infected at a given time), there 
are situations in which drug treatment may incite the 
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development of a drug-resistant strain of the disease 
and an increase in the treatment level may actually 
increase the prevalence of disease, with the histories 
of tuberculosis and HIV/AIDS being good examples of 
this phenomenon. Modeling is essential to identify the 
possibility of such counterintuitive effects. 

While the foundations of mathematical epidemiology 
were laid by public health physicians, there have been 
many theoretical elaborations. An elaborate mathemat- 
ical theory has developed, and there has been a diver- 
gence of interests between mathematicians and public 
health professionals. One result of this has been that 
there are both strategic models, which concentrate on 
general classes of models and theoretical understand- 
ing, and tactical models, which attempt to incorporate 
as much detail as possible into a model in order to be 
able to make good quantitative predictions. In recent 
years there have been serious attempts to encourage 
interaction between mathematical modelers and pub- 
lic health professionals to improve understanding of 
different views. 

We concentrate here on relatively simple models and 
qualitative understanding, but we recognize that the 
models used to make quantitative predictions must 
have much more detail and incorporate data. They 
must therefore usually be solved numerically. The 
development of high-speed computers has made it 
possible to simulate very complicated models quickly, 
and this has had a great influence on disease-outbreak 
management. 

There are many different types of infectious disease 
transmission models. We consider only deterministic 
models suitable for human disease, although there is 
an extensive parallel theory of stochastic models. Also, 
we confine our attention to continuous models, though 
it can be argued that since disease transmission data 
are obtained at discrete times it would be reasonable 
to use discrete models. 

For modeling an infectious disease, as for modeling 
other natural phenomena, it is essential to understand 
the biology, then to translate the biological problem 
into a mathematical framework for the important fea- 
tures, and finally to translate results back to the biol- 
ogy, bearing in mind the assumptions made. An impor- 
tant distinction is made between short-term models, 
as for disease outbreaks and epidemics, in which 
demographic effects such as births and deaths can be 
ignored, and long-term models, as for endemic situa- 
tions, in which it is necessary to include demographic 


effects. In fact, an essential difference between these 
two categories of models is the absence or presence of 
a flow of new susceptible people into the population, 
and such a flow may come from demographic effects or 
from recovery of infectious individuals without immu- 
nity against reinfection. 

We mainly concentrate on compartmental models, 
originally introduced by Kermack and McKendrick, in 
which a given population is separated into compart- 
ments identified by the disease status of the individ- 
uals in those compartments. For example, there are 
SIR (susceptible-infectious-removed) models in which 
individuals are susceptible to infection; are infectious; 
or are removed from infection by recovery from infec- 
tion with immunity against reinfection, by inoculation 
against being infected, or by death from the disease. 
The model is described by assumptions about the rates 
of passage between compartments. In an SIR model 
it is assumed that recovery provides complete immu- 
nity against reinfection. There are also SIS (susceptible- 
infectious-susceptible) models, describing a situation 
in which recovered individuals are immediately suscep- 
tible to reinfection rather than being removed. These 
two types of model describe diseases with fundamen- 
tally different properties. Typically, diseases caused 
by a virus (e.g., influenza) are of SIR type, while dis- 
eases caused by a bacterium (e.g., gonorrhea) are of 
SIS type. More elaborate compartmental structures are 
possible, such as SLIR (susceptible-latent-infectious- 
removed) models, in which there is a latent period 
between becoming infected and becoming infectious 
(the compartment L represents individuals who are 
infected but not yet infectious); models in which some 
individuals are selected for treatment; or models in 
which some individuals are asymptomatic, i.e., infec- 
tious but without disease symptoms. Also, there are 
diseases of SIRS type, for which recovery provides 
immunity but only temporarily. For example, influenza 
strains mutate rapidly, and recovery from one strain 
provides immunity against reinfection only until the 
strain mutates enough to act as a different strain. 
We consider SIR and SIS as the basic disease types 
and describe their analysis, but we also indicate how 
more complicated compartmental structures may be 
described. 

Our aim is to give a taste of this exciting area of 
applied mathematics, and to encourage further explo- 
ration, rather than to present a complete portrait of 
mathematical epidemiology. 
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Figure 1 An SIR model flow chart with demographics. 


2 SIR Models: Measles and Influe nz a 

Measles and influenza are viral diseases and, in most 
cases, an individual who has been infected and has sub- 
sequently recovered has lifelong immunity to the dis- 
ease; thus, an SIR model is appropriate. Measles (with- 
out vaccination) often stays endemic in a population. 
By contrast, seasonal influenza is a relatively short- 
lived epidemic, with the number of infectious individ- 
uals peaking and then going to zero. We use these two 
diseases as examples of general SIR models. 


2.1 Models with Demographics: Measles 


We want to know what happens if a small number of 
people infected with measles are introduced into a sus- 
ceptible population of N people. To investigate this, 
we formulate an ordinary differential equation (ODE) 
system that determines the change in the numbers of 
susceptible, infectious, and recovered people with time, 
denoted by S, I, and R, respectively, and includes demo- 
graphics but assumes that there are no deaths due to 
measles. A flow chart for the model is given in figure 1. 
Let A > 0 be the number of newborns per unit time in 
the population, let d > 0 be the natural death rate, let 
1/y be the average duration of measles infection (this 
is about five days), and let A be the disease transmis- 
sion parameter. The rate of change of S is then given as 
the input rate (A) minus the output rate (A SI/N + dS), 
and similarly for the rates of change of I and R. Thus, 
measles evolves in time according to the equations 


dS 
dt 
df 
df 
d R 
dt 


A SI 


ASI 

~N~ 


- dS, 


- (d + y)I, 


= yl - dR. 


Here N = S + I + R, and it is assumed that an aver- 
age infectious person makes A contacts in unit time, of 
which a fraction S/N are with susceptible people and 


thus transmit infection, giving ASI/N new infections in 
unit time. This is called standard incidence. The model 
above assumes that all newborns enter the susceptible 
class, thus ignoring passive immunity from maternal 
antibodies. 

The R equation is not in fact needed as R does not 
enter into the other equations: it can therefore be deter- 
mined from S and I. The first two equations always 
have one equilibrium: that is, a constant solution with 
dS/dt = dl/dt = 0. This is given by ( S,I ) = ( A/d,0 ) 
and is called the disease-free equilibrium (DFE). The 
equations may have another equilibrium {S* ,1*) with 
I* > 0, and this is called an endemic equilibrium. 

Assume that a small number of people in a suscepti- 
ble population are infected with measles. Will measles 
die out or become endemic? Analysis of the equilib- 
ria show that this depends on the product of the dis- 
ease transmission parameter and the average death- 
adjusted infectious duration. This product, denoted by 
Ro, with Ro = A / (d + y), is called the basic reproduc- 
tion number and is the average number of new infec- 
tions caused by an infectious person introduced into 
a susceptible population. If Ro < 1, then the solution 
tends to the DFE and measles dies out in the popula- 
tion. If Ro > 1, then measles tends to an endemic level 
with the number of infectious people: 



with N = A/d. Thus Ro = 1 acts as a critical level, or 
threshold, at which the model exhibits a forward bifur- 
cation. There is an exchange of stability between the 
disease-free and endemic equilibria. Stochastic models 
usually exhibit a similar threshold, but if Ro > 1, then 
there is a finite probability of disease extinction. 

In order to prevent measles from becoming endemic, 
a fraction p > 1 - 1 /Ro of newborns needs to be vac- 
cinated to reduce Ro below 1 and give the popula- 
tion herd immunity. This was achieved worldwide for 
smallpox (eradicated in 1977), which had Ro ~ 5, so 
that theoretically 80% vaccination provided herd immu- 
nity. However, measles has a higher Ro value, up to 18 
in some populations, making it unrealistic to achieve 
herd immunity. Data on measles in the pre-vaccine era 
show biennial oscillations about an endemic level, so 
more complicated models including seasonal forcing 
are needed for accurate predictions. 

2.2 Models Ignoring Demographics: Influenza 

Interest in epidemic models, which had been largely 
ignored for many years, was reignited by the severe 
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Figure 2 An SIR model flow chart with no demographics. 



Figure 3 Simulation of the influenza SIR model showing 
the number of infectious people against time, with A = 0.5, 
y = 0.25, N = 1000, 1(0) = 5. 


acute respiratory syndrome (SARS) outbreak of 2002-3, 
and this interest carried over into concerns about the 
possibility of an influenza pandemic. 

For seasonal influenza, birth and death can be 
ignored, since this is a single outbreak of short duration 
and the demographic timescale is much slower than the 
epidemiological timescale (see figure 2 for a flow chart). 
Setting A = d = 0 in the measles model gives a model 
with no endemic equilibrium since I = 0 is the only 
equilibrium condition. With an initial number of infec- 
tious people 1(0) > 0, the parameter Ro = A/y still 
acts as a threshold. Influenza first increases to a peak 
and then decreases to zero (an influenza epidemic) if 
Ro > 1, but it simply decays to zero (no epidemic) if 
Ro < 1- Figure 3 shows a numerical solution of the 
model equations (done with Maple) with parameters 
that give Ro = 2 for a population of 1000 starting with 
5 infectious people. With Ro > 1, the model shows that 
not all susceptible people are infected before the epi- 
demic dies out. In fact, the number of susceptible peo- 
ple remaining uninfected, 5( co), is given implicitly as 


the positive root of the final size equation: 



where 5(0) ~ N is the initial number of susceptible 
people. The total number infected by the disease is 
J(0) + 5(0) - 5(<»). The attack rate (the fraction of 
the population infected by the disease) is 1 - 5(oo)/JV. 
The epidemic first grows approximately exponentially, 
since for small time, 

I(t) « 1(0) exp{y(R 0 - l)t}. 

This initial growth rate can be determined from data, 
and together with a value of y from data it can be used 
to estimate Ro- For seasonal influenza, Ro is usually 
found to be in the range 1.4-2. 4. The above formula 
for herd immunity also applies for estimating the frac- 
tion of the population that would need to be vacci- 
nated to theoretically eradicate influenza. However, at 
the beginning of a disease outbreak, stochastic effects 
are important. 

For more realistic models, other factors such as a 
latent period, deaths due to influenza, asymptomatic 
cases (people with no symptoms of disease but able 
to transmit infection), and age structure need to be 
included. These can all be put in a more general frame- 
work in which the essential properties continue to hold. 
However, the quantitative analysis of such complicated 
models requires numerical simulations. Antiviral treat- 
ment is also available for influenza, and model results 
can guide public health policy on vaccination and 
antiviral schedules. In fact, model predictions helped 
to guide such policies during the 2009 H1N1 influenza 
pandemic. Treatment with antiviral drugs can cause the 
emergence of resistant strains. Some recent models for 
optimal treatment strategies during an influenza pan- 
demic suggest that it is best to do nothing initially and 
then quickly ramp up treatment. However, policy mak- 
ers must weigh modeling predictions (and their inher- 
ent assumptions) with ethical, economic, and political 
concerns. 

As noted in our introduction, recovery from influ- 
enza may provide temporary immunity, indicating an 
SIRS model, which has a basic reproduction number Ro 
such that Ro = 1 acts as a sharp threshold (between 
the DFE and the endemic equilibrium), as it does in the 
SIR model with demographics. Models of this type that 
take into account that vaccination is not totally effec- 
tive and also wanes with time can give rise to a back- 
ward bifurcation, in which Ro must be reduced below 
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Figure 4 An SIS model flow chart. 


a value H c that is less than one to eradicate the dis- 
ease. In models that exhibit backward bifurcation, for 
H c < H < 1 there are two endemic equilibria, with 
the larger I value locally stable and the smaller I value 
unstable, in addition to the locally stable DFE. Thus, in 
this range of parameter values the disease may die out 
or tend to the larger endemic equilibrium depending 
on the initial conditions. 

If we assume that the period of temporary immunity 
is a constant, then a delay differential equation SIRS 
model can be formulated, and it predicts that there 
are periodic solutions arising through a hopf bifurca- 
tion [IV.21 §2] about an endemic equilibrium for some 
values of this period of temporary immunity. Data on 
disease dynamics often exhibit periodicity; it is there- 
fore an interesting challenge to determine what factors 
contribute to this oscillatory behavior. Delay differen- 
tial equations and systems that assume more general 
distributions pose challenging mathematical problems. 


3 SIS Models: Gonorrhea 


Bacterial diseases do not usually give immunity on 
recovery, so an SIS model is appropriate as a simple 
model for such diseases (see figure 4 for a flow chart). 
Ignoring death due to the disease, taking the total pop- 
ulation at its equilibrium value M = A/d , and using 
S = N - /, the model can be described by the equa- 
tion for the rate of change of the number of infectious 
people with time: 


d/ 

dt 


A / 


N -I 
N 


( d + y)I. 


Given an initial number of infectious people, this logis- 
tic equation can be solved by separation of variables 
to give the number of infectious people at any later 
time. Alternatively, an analysis of the equilibria can 
be used to show that the basic reproduction number 
Ho = A / {d + y ) acts as a threshold, with the disease 


dying out if Ho < 1 and going to an endemic level 
I* = N(1 - 1/Ho) if Ho > 1. Even if demograph- 
ics are ignored, these conclusions still hold, unlike the 
epidemic found for an SIR model. 

Now consider a model for the heterosexual trans- 
mission of gonorrhea: a sexually transmitted bacterial 
disease. It is necessary to divide the population into 
females and males and to model transmission of the 
disease between them. Assume that there are no deaths 
due to disease, so that the populations of females Mf 
and of males Mm are constants. Set p? = d + Tf and 
Pm = d + yM, which are the natural death-corrected 
transition rates. The equations governing the number 
of infectious females If and infectious males Im are 


= Amf ^ F .. Im - Pth, 
dt JVf 

cUm , M m - Im t T 

-jr = AfM — h - PmIm, 

dt Mm 


where Amf and Afm are the transmission coefficients 
from male to female and female to male, and 1 / yF and 
1/yM are the mean infectious periods for females and 
males. Suppose that there is initially a small number 
of males or females infected with gonorrhea. Will the 
outbreak persist or will it die out? The equations have 
a DFE with (If.Im) = (0,0), and the behavior close to 
this can be determined by linearizing the system (i.e., 
by neglecting terms of higher order than linear). Linear 
stability of the DFE is determined by setting h = Cie zt 
and I\] = C 2 e zt in this linearized system and determin- 
ing the sign of the real part of z for nonzero constants 
Ci, C 2 . This is equivalent to finding the eigenvalues of 
the Jacobian matrix J at the DFE, where 


j _ f~P f Amf 

\Afm - Pm 

The eigenvalues of J are given by the roots of the 
characteristic equation 

z 2 + z(pf + Pm) + PtPM ~ AmfAfm = 0. 


Both roots of this quadratic equation have negative real 
parts if and only if the constant term is positive. With 
this condition, small perturbations from the DFE decay, 
and the DFE is linearly stable. Drawing on experience 
from the previous models and taking into account bio- 
logical considerations, Ho can be determined from this 
condition (see the formula for Ho given below). Alter- 
natively, there is a systematic way to find Ho . Write J 
as J = F - V, where F is the matrix associated with new 
infections and V is the matrix associated with transi- 
tions. Then Ho = p(FV~ l ), where p denotes the spec- 
tral radius (i.e., the eigenvalue with the largest absolute 
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value) and FV 1 is the next-generation matrix. For the 
above model, 


/ 0 Amf\ 
\Afm 0 ) 



giving 


^ _ / AmfAfm 
0 V PMPF 

This expression contains a square root because the 
basic reproduction number is the geometric mean of 
the reproduction number for each sex, namely Amf/Pf 
and Afm/Pm- Linear stability is governed by o, but in 
fact global results also hold. If Hq < 1, then the disease 
dies out in both sexes; whereas if 3to > 1, the disease 
tends to an endemic state (Jj? ,1^) and persists in both 
sexes. This endemic state can be found by solving for 
the positive equilibrium, giving 


I* _ (AmfAfm - PfPmI-NfNm 
M PmAmfIVm + AmfAfm Nr ’ 
with a corresponding formula for Jp . The global sta- 
bility of the endemic state can be proved by using Lya- 
punov functions, but as far as we know it is still an open 
problem to prove this if deaths due to the disease are 
incorporated into the model. 


4 Models for HIV/AIDS 
4.1 Population Models 

The acquired immune deficiency syndrome (AIDS) epi- 
demic that began in the early 1980s spread world- 
wide and has prompted much research on this and 
other sexually transmitted diseases. AIDS is caused 
by the human immunodeficiency virus (HIV) and once 
infected, individuals never recover. Data show that 
after initial infection with the virus an individual has 
high infectivity, but infectivity then drops to a low level 
for a period of up to ten years before rising sharply due 
to the onset of AIDS. Infectivity therefore has a “bath- 
tub” shape. Compartmental ODE models with several 
stages of infectivity subdividing the infectious state 
can be formulated to approximate this shape. Alterna- 
tively, this can be handled by partial differential equa- 
tion models using age of infection as a variable. Cur- 
rently, HIV can be treated with highly active antiretro- 
viral therapy drugs, which both reduce the symptoms 
and prolong the period of low infectivity. 

Models for HIV/AIDS often focus on one population: 
a male homosexual community, for example, or a group 
of female sex workers and their male clients. Such mod- 
els need to take account of many factors, including level 


of sexual activity, drug use, condom use, and sexual 
contact network, and this results in large-scale systems 
with many parameters that need to be estimated from 
data. The homogeneous models described previously 
therefore need to be extended. 

To illustrate this with a simple model with one infec- 
tious stage, consider a male homosexual community in 
which there are k risk groups that are delineated by 
the numbers of partners an individual has each month, 
thus introducing heterogeneity. Letting Si, and A,- 
be the proportion of males with i partners each month 
who are susceptible, infectious, and have developed 
AIDS, respectively, the ODEs for the system are 

d d S t ' = dn, - Y PijSilj - dS u 

j= 1 

dr k 

~rr = X PijSilj ~ dli ~ Yh, 

j= 1 

—r 1 = Yh ~ dAi - mAi, 

d t 

for i = l,...,fc, where n; is the proportion of males 
with i contacts per month, dnt is the recruitment into 
the male sexually active population with i partners per 
month, m is mortality due to AIDS, and / contains the 
contact and transmission rates between a susceptible 
male with i partners per month and an infectious male 
with j partners per month. This formulation assumes 
that all infectious males proceed to AIDS with rate y 
and at that stage they withdraw from sexual activity 
and do not continue to contribute to disease spread. 
The contact matrix among the males in the population 
( ij ) can be determined by surveying their distribu- 
tion and the number of sexual partners they have. This 
matrix is sometimes termed a WAIFW (who acquires 
infection from whom) matrix. If it is assumed that part- 
nerships are formed at random but in proportion to 
their expected number of partners, then 


PiJ = P 


'J 


lii hn t 


Making this assumption and taking Pi = p and y* = y, 
the next-generation matrix method has F matrix with 
rank 1, and V a scalar matrix. Thus, the nonzero eigen- 
value of FV- 1 is equal to its trace, giving the basic 
reproduction number as 

P Si=! i 2 nt P 


Ho '■ 


d - 


:(»♦£)■ 


y Z?= i int d + y' 

where M and V are the mean and variance of the num- 
ber of sexual partners per month. If all males have the 
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same number of partners, then Ho = pM/ (d + y), and 
the above formula therefore shows that Hq increases 
with the variance in the number of partners. Other 
assumptions about the formation of partnerships can 
be incorporated into this model structure. For example, 
males could be more likely to form partnerships with 
others having the same or a similar number of contacts 
per month (assortative mixing). 

To extend the above model to study the heterosexual 
spread of HIV/AIDS, the population must be divided 
into two sexes as well as being split according to activ- 
ity level. Contact rates together with parameters related 
to the disease need to be determined to parametrize 
the model. It is generally observed that the probabil- 
ity of transmission per sexual act from an infectious 
male to a susceptible female is greater than that from 
an infectious female to a susceptible male, so there is 
asymmetry in the model. 

Control strategies such as condoms, highly active 
antiretroviral therapy treatment, and education can be 
incorporated into the models for HIV/AIDS to guide 
planning so that control can be optimized. However, 
HIV develops resistance against the drags that are cur- 
rently used for its treatment, and this, as well as a 
lack of treatment compliance, needs to be built into 
models. In addition, individuals who are HIV positive 
have a higher than average risk of developing other 
diseases, such as tuberculosis and pneumonia. Rather 
than developing such complicated models further, we 
turn instead to a brief discussion of intra-host models. 

4.2 Virus Dynamics 

When HIV virus enters an individual's body it attacks 
the cells susceptible to infection, called target cells, and 
produces infected cells that in turn produce new virus. 
A basic model of intra-host virus infection consists of 
the following three ODEs for the number of susceptible 
target cells T, actively infected target cells I, and free 
mature virus particles (virions) V: 

^ = R - djT - kVT, 

Cl t 

^ = kVT - dll, 

d t 

% = pi-d v v. 

Here, dj, di, and dy are the death rates of target cells, 
infected cells, and virus particles, respectively; target 
cells are infected by virus with rate constant k assum- 
ing mass action incidence; R is the production rate of 



Figure 5 Vims dynamics flow chart. 


new target cells; and p is the production rate of new 
virus per infected cell. A flow chart for this model is 
given in figure 5. 

If a person is initially uninfected and then a small 
amount of HIV virus is introduced, will the virus be suc- 
cessful in establishing a persistent HIV infection? This 
question can be addressed by a linear stability analysis, 
as developed in previous sections. The DFE is given by 
(T, V, I) = (R/ dr, 0, 0), and the endemic equilibrium (if 
it exists) is given by 

( d\dy R^ _ djdy Rp dr \ 

1 ' ’ “ V kp ’ di kp ’ didy k )' 

Assuming that the production of new virus (i.e., the 
pi term in the equation for V) is not considered to be a 
new infection, the next-generation matrix method gives 


= kpR 

0 djdidy ' 


However, if this production is assumed to be a new 
infection, then this term must go in the F matrix and 
gives 


Hq 


kpR 

djdidy 


The endemic equilibrium exists and persistent HIV 
infection occurs if these numbers are greater than 1, 
which can be proved by using a Lyapunov function. 
In fact, this differential equation system for the early 
stage of HIV infection is equivalent to the population- 
level SLIR system with a constant host population size. 
Virus dynamic models therefore have features in com- 
mon with disease transmission models even though the 
biological background is different. 

Notice that these formulas for Ho agree at the 1- 
threshold value. This example illustrates the fact that 
the formula for Hq depends on the biological assump- 
tions of the model, and several formulas can give the 
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same expression at the 1 -threshold value. However, 
the sensitivity of to perturbations in parameters 
depends on the exact formula. Drug treatment aims 
to reduce the parameter k so that Rn is below 1. The 
parameters and effects of drugs can be estimated from 
data on individuals who are HIV positive. 

5 Model Extensions 

Some infectious diseases have features that may neces- 
sitate models with compartments additional to those 
described above. For example, some diseases are trans- 
mitted to humans by vectors, as is the case for 
malaria and West Nile virus, which are transmitted by 
mosquitos. For such diseases, vector compartments are 
needed. Waterborne diseases such as cholera can be 
transmitted directly from person to person or indi- 
rectly via contaminated water; thus some cholera mod- 
els include a pathogen compartment. Other infectious 
diseases, for example HIV/AIDS and hepatitis B, can 
be transmitted vertically from mother to offspring. 
In the case of hepatitis B, the mother may even be 
asymptomatic. This alternative route of transmission 
can be modeled by adding an input term into the 
infectious class that represents infectious newborns, 
and this modifies the basic reproduction number. For 
Ebola, recently dead bodies are an important source of 
infection that needs to be included in a model. From 
the recent outbreak in West Africa, it appears that 
health-care workers have an increased risk of infec- 
tion, and there may be a significant number of asymp- 
tomatic cases; these compartments should therefore be 
explicitly included in an Ebola model. 

The simplest compartmental models assume homo- 
geneous mixing of individuals, but it is possible to 
extend the structure to include heterogeneity of mix- 
ing, as briefly described above in the model for HIV/ 
AIDS in a male heterosexual population. Network mod- 
els take this still further by concentrating on the fre- 
quencies of contacts between individuals. Agent-based 
models separate the population into individuals, lead- 
ing to very large systems that can be analyzed only by 
numerical simulations. A particularly important het- 
erogeneity in disease transmission is age structure, 
since in many diseases, especially childhood diseases 
such as measles, most transmission of infection occurs 
between individuals of similar ages. Spatial heterogene- 
ity appears in two quite different forms: namely, local 
motion such as diffusion and long-distance travel such 
as by airlines between distant locations. The former 


is usually modeled by partial differential equations, 
whereas the latter is usually modeled by a large sys- 
tem of ODEs, giving a metapopulation model. With the 
availability of good travel data, metapopulation models 
are especially important for public health planning for 
mass gatherings such as the Olympic Games. New or 
newly emerging infectious diseases often call for new 
modeling ideas; for example, metapopulation models 
were further developed for SARS, and coinfection mod- 
els have been developed for HIV and tuberculosis. 
In addition, social behavior and the way that people 
change their behavior during an epidemic are factors 
that should be integrated into models, especially those 
designed for planning vaccination and other control 
strategies. 

For interested readers the literature listed in the fur- 
ther reading section below, as well as current jour- 
nal articles and online resources, will provide more 
information about these and other models. 
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V.17 The Mathematics of Sea Ice 

Kenneth M. Golden 


1 Introduction 

Among the large-scale transformations of the Earth’s 
surface that are apparently due to global warming, the 
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sharp decline of the summer Arctic sea ice pack is prob- 
ably the most dramatic. For example, the area of the 
Arctic Ocean covered by sea ice in September of 2012 
was less than half of its average over the 1980s and 
1990s. While global climate models generally predict 
declines in the polar sea ice packs over the twenty-first 
century, these precipitous losses have significantly out- 
paced most projections. Here we will show how math- 
ematics is being used to better understand the role 
of sea ice in the climate system and improve projec- 
tions of climate change. In particular, we will focus on 
how mathematical models of composite materials and 
statistical physics are being used to study key sea ice 
structures and processes, as well as represent sea ice 
more rigorously in global climate models. Also, we will 
briefly discuss these climate models as systems of par- 
tial differential equations (PDEs) solved using computer 
programs with millions of lines of code, on some of 
the world’s most powerful computers, with particular 
focus on their sea ice components. 

1.1 Sea Ice and the Climate System 

Sea ice is frozen ocean water, which freezes at a tem- 
perature of about -1.8 °C, or 28.8 °F. As a material, 
sea ice is quite different from the glacial ice in the 
world’s great ice sheets covering Antarctica and Green- 
land. When salt water freezes, the result is a composite 
of pure ice with inclusions of liquid brine, air pockets, 
and solid salts. As the temperature of sea ice increases, 
the porosity or volume fraction of brine increases. The 
brine inclusions in sea ice host extensive algal and bac- 
terial communities that are essential for supporting life 
in the polar oceans. For example, krill feed on the algae, 
and in turn they support fishes, penguins, seals, and 
Minke whales, and on up the food chain to the top 
predators: killer whales, leopard seals, and polar bears. 
The brine microstructure also facilitates the flow of salt 
water through sea ice, which mediates a broad range of 
processes, such as the growth and decay of seasonal 
ice, the evolution of ice pack reflectance, and biomass 
buildup. 

As the boundary between the ocean and the atmo- 
sphere in the polar regions of the Earth, sea ice plays 
a critical role as both a leading indicator of climate 
change and as a key player in the global climate sys- 
tem. Roughly speaking, most of the solar radiation that 
is incident on snow-covered sea ice is reflected, while 
most of the solar radiation that is incident on darker 
sea water is absorbed. The sea ice packs serve as part 


of Earth’s polar refrigerator, cooling it and protecting 
it from absorbing too much heat from sunlight. The 
ratio of reflected sunlight to incident sunlight is called 
albedo. While the albedo of snow-covered ice is usually 
larger than 0.7, the albedo of sea water is an order of 
magnitude smaller, around 0.06. 

1.1.1 Ice-Albedo Feedback 

As warming temperatures melt more sea ice over time, 
fewer bright surfaces are available to reflect sunlight, 
more heat escapes from the ocean to warm the atmo- 
sphere, and the ice melts further. As more ice is 
melted, the albedo of the polar oceans is lowered, lead- 
ing to more solar absorption and warming, which in 
turn leads to more melting, creating a positive feed- 
back loop. It is believed that this so-called ice-albedo 
feedback has played an important role in the recent 
dramatic declines in summer Arctic sea ice extent. 

Thus even a small increase in temperature can lead 
to greater warming over time, making the polar regions 
the most sensitive areas to climate change on Earth. 
Global warming is amplified in the polar regions. 
Indeed, global climate models consistently show ampli- 
fied warming in the high-latitude Arctic, although the 
magnitude varies considerably across different mod- 
els. For example, the average surface air temperature 
at the North Pole by the end of the twenty-first cen- 
tury is predicted to rise by a factor of about 1.5 to 4 
times the predicted increase in global average surface 
air temperature. 

While global climate models generally predict de- 
clines in sea ice area and thickness, they have sig- 
nificantly underestimated the recent losses observed 
in summer Arctic sea ice. Improving projections of 
the fate of Earth’s sea ice cover and its ecosystems 
depends on a better understanding of important pro- 
cesses and feedback mechanisms. For example, dur- 
ing the melt season the Arctic sea ice cover becomes 
a complex, evolving mosaic of ice, melt ponds, and 
open water. The albedo of sea ice floes is determined 
by melt pond evolution. Drainage of the ponds, with 
a resulting increase in albedo, is largely controlled by 
the fluid permeability of the porous sea ice underly- 
ing the ponds. As ponds develop, ice-albedo feedback 
enhances the melting process. Moreover, this feedback 
loop is the driving mechanism in mathematical models 
developed to address the question of whether we have 
passed a so-called tipping point or critical threshold in 
the decline of summer Arctic sea ice. Such studies often 
focus on the existence of saddle-node bifurcations in 
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(a) (b) 



(c) (d) 


Figure 1 Sea ice exhibits composite structure on length 
scales over many orders of magnitude: (a) the submillime- 
ter scale brine inclusions in sea ice (credit: CRREL (U.S. 
Army Cold Regions Research and Engineering Lab) report); 
(b) pancake ice in the Southern Ocean, with microstructural 
scale on the order of tens of centimeters; (c) melt ponds on 
the surface of Arctic sea ice with meter-scale microstructure 
(courtesy of Donald Perovich); and (d) ice floes in the Arctic 
Ocean on the kilometer scale (courtesy of Donald Perovich). 

dynamical system models of sea ice coverage of the 
Arctic Ocean. In general, sea ice albedo represents a 
significant source of uncertainty in climate projections 
and a fundamental problem in climate modeling. 

1.1.2 Multiscale Structure of Sea Ice 

One of the fascinating, yet challenging, aspects of mod- 
eling sea ice and its role in global climate is the sheer 
range of relevant length scales of structure, over ten 
orders of magnitude, from the submillimeter scale to 
hundreds of kilometers. In figure 1 we show exam- 
ples of sea ice structure illustrating such a range of 
scales. Modeling sea ice on a large scale depends on 
some understanding of the physical properties of sea 
ice at the scale of individual floes, and even on the 
submillimeter scale since the brine phase in sea ice 
is such a key determinant of its bulk physical proper- 
ties. Today’s climate models challenge the most pow- 
erful supercomputers to their fullest capacity. How- 
ever, even the largest computers still limit the hor- 
izontal resolution to tens of kilometers and require 
clever approximations and parametrizations to model 
the basic physics of sea ice. One of the central themes of 
this article is how to use information on smaller scales 
to predict behavior on larger scales. We observe that 


this central problem of climate science shares common- 
ality with, for example, the key challenges in theoretical 
computations of the effective properties of composites. 

Here we will explore some of the mathematics used 
in studying sea ice and its role in the climate system, 
particularly through the lens of sea ice albedo and 
processes related to its evolution. 

2 Global Climate Models and Sea Ice 

Global climate models, also known as general circu- 
lation models, are systems of PDEs derived from the 
basic laws of physics, chemistry, and fluid motion. They 
describe the state of the ocean, ice, atmosphere, and 
land, as well as the interactions between them. The 
equations are solved on very powerful computers using 
three-dimensional grids of the air -ice-ocean-land sys- 
tem, with horizontal grid sizes on the order of tens 
of kilometers. Consideration of general climate models 
will take us too far off course, but here we will briefly 
consider the sea ice components of these large-scale 
models. 

The polar sea ice packs consist primarily of open 
water, thin first-year ice, thicker multiyear ice, and pres- 
sure ridges created by ice floes colliding with each 
other. The dynamic and thermodynamic characteris- 
tics of the ice pack depend largely on how much ice 
is in each thickness range. One of the most basic prob- 
lems in sea ice modeling is thus to describe the evolu- 
tion of the ice thickness distribution in space and time. 
The ice thickness distribution g(x, t, h) d h is defined 
(informally) as the fractional area covered by ice in the 
thickness range (h, h + d h) at a given time t and loca- 
tion x. The fundamental equation controlling the evo- 
lution of the ice thickness distribution, which is solved 
numerically in sea ice models, is 

where u is the horizontal ice velocity, /-> is the rate of 
thermodynamic ice growth, and f is a ridging redis- 
tribution function that accounts for changes in ice 
thickness due to ridging and mechanical processes, as 
illustrated in figure 2. 

The momentum equation, or Newton's second law for 
sea ice, can be deduced by considering the forces on a 
single floe, including interactions with other floes: 

Du „ „ rr 

m- — = V ■ ct + T a + t w - man x u - mgvH, 

where each term has units of force per unit area of the 
sea ice cover, m is the combined mass of ice and snow 
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Divergence Convergence 

Ridge sails 



Figure 2 Different factors contributing to the evolution of 
the ice thickness distribution g(x, t,h ). (Adapted, courtesy 
of Christian Haas.) 


per unit area, T a and t w are wind and ocean stresses, 
and D/Dt = (d/dt) + u ■ V is the material or convective 
derivative. This is a two-dimensional equation obtained 
by integrating the three-dimensional equation through 
the thickness of the ice in the vertical direction. 

The strength of the ice is represented by the internal 
stress tensor cry. The other two terms on the right-hand 
side are, in order, stresses due to Coriolis effects and 
the sea surface slope, where n is a unit normal vector 
in the vertical direction, « is the Coriolis parameter, H 
describes the sea surface, and in this equation g is the 
acceleration due to gravity. 

The temperature field T(x, t) inside the sea ice (and 
snow layer), which couples to the ocean below and 
the atmosphere above through appropriate boundary 
conditions, satisfies an advection-diffusion equation 

"5 T 

= = V ■ (D(T)VT) - v ■ VT, 
ot 

where D = K/ pC is the thermal diffusivity of sea ice, 
K is its thermal conductivity, p is its bulk density, C 
is the specific heat, and v is an averaged brine velocity 
field in the sea ice. 

The bulk properties of low-Reynolds-number flow of 
brine of viscosity rj through sea ice can be related to the 
geometrical properties of the porous brine microstruc- 
ture using homogenization theory [11.17]. The vol- 
ume fractions of brine and ice are <p and 1 - <fi. The 
local velocity and pressure fields in the brine satisfy the 
Stokes equations for incompressible fluids, where the 
length scale of the microstructure (e.g., the period in 
periodic media) is e . Under appropriate assumptions, in 
the homogenization limit as e — 0, the averaged veloc- 
ity v(x) and pressure p(x) satisfy Darcy's law and the 
incompressibility condition 

v = - — fcVp, V • v = 0. (1) 

n 


Here, k is the permeability tensor, with vertical compo- 
nent k zz = k in units of m 2 . The permeability k is an 
example of an effective or homogenized parameter. The 
existence of the homogenized limits v, k, and p in (1) 
can be proven under broad assumptions, such as for 
media with inhomogeneities that are periodic or have 
translation-invariant statistics. 

Obtaining quantitative information on k or other 
effective transport coefficients — such as electrical or 
thermal conductivity and how they depend on, say, the 
statistical properties of the microstructure — is a cen- 
tral problem in the theory of composites. A broad range 
of techniques have been developed to obtain rigorous 
bounds, approximate formulas, and general theories 
of effective properties of composite and inhomogen- 
eous media in terms of partial information about the 
microstructure. This problem is, of course, quite sim- 
ilar in nature to the fundamental questions of calcu- 
lating bulk properties of matter from information on 
molecular interactions, which is central to statistical 
mechanics. 

We note that it is also the case that one of the fun- 
damental challenges of climate modeling is how to rig- 
orously account for sub-grid scale processes and struc- 
tures. That is, how do we incorporate important effects 
into climate models when the scale of the relevant 
phenomena being incorporated is far smaller than the 
grid size of the numerical model, which may be tens 
of kilometers. For example, it is obviously unrealis- 
tic to account for every detail of the submillimeter- 
scale brine microstructure in sea ice in a general cir- 
culation model! However, the volume fraction and 
connectedness properties of the brine phase control 
whether or not fluid can flow through the ice. The 
on-off switch for fluid flow in sea ice, known as the 
rule of fives (see below), in turn controls such critical 
processes as melt pond drainage, snow-ice formation 
(where sea water percolates upward, floods the snow 
layer on the sea ice surface, and subsequently freezes), 
the evolution of salinity profiles, and nutrient replen- 
ishment. It is the homogenized transport coefficient 
(the effective fluid permeability) that is incorporated 
into sea ice and climate models to account for these 
and related physical and biogeochemical processes. 
This effective coefficient is a well-defined parameter 
(under appropriate assumptions about the microstruc- 
ture) that captures the relevant microstructural tran- 
sitions and determines how a number of sea ice pro- 
cesses evolve. In this example we will see that rigor- 
ous mathematical methods can be employed to analyze 
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effective sea ice behavior on length scales much greater 
than the submillimeter scale of the brine inclusions. 

3 Mathematics of Composites 

Here we give a brief overview of some of the mathemati- 
cal models and techniques that are used in studying the 
effective properties of sea ice. 

3.1 Percolation Theory 

Percolation theory was initiated in 1957 with the intro- 
duction of a simple lattice model to study the flow 
of air through permeable sandstones used in miner’s 
gas masks. In subsequent decades this theory has 
been used to successfully model a broad array of dis- 
ordered materials and processes, including flow in 
porous media like rocks and soils; doped semicon- 
ductors; and various types of disordered conductors 
like piezoresistors, thermistors, radar-absorbing com- 
posites, carbon nanotube composites, and polar firn. 
The original percolation model and its generalizations 
have been the subject of intensive theoretical investi- 
gations, particularly in the physics and mathematics 
communities. One reason for the broad interest in the 
percolation model is that it is perhaps the simplest 
purely probabilistic model that exhibits a type of phase 
transition. 

The simplest form of the lattice percolation model 
is defined as follows. Consider the d-dimensional inte- 
ger lattice Z d , and the square or cubic network of bonds 
joining nearest-neighbor lattice sites. We assign to each 
bond a conductivity cro > 0 (not to be confused with 
the stress tensor above) with probability p, meaning 
it is open, and a conductivity 0 with probability 1 - p, 
meaning it is closed. Two examples of lattice configura- 
tions are shown in figure 3, with p = | in (a) and p = | 
in (b). Groups of connected open bonds are called open 
clusters. In this model there is a critical probability p c , 
0 < p c < 1, called the percolation threshold, at which 
the average cluster size diverges and an infinite cluster 
appears. For the two-dimensional bond lattice, p c = j ■ 
For p < pc, the density of the infinite cluster P«,(p) is 
0, while for p > p c , Poo ip) > 0 and near the threshold, 

Poo(p) ~ (p - Pc)P, P - Pc . 

where /? is a universal critical exponent, that is, it 
depends only on dimension and not on the details of 
the lattice. Let x,y e Z d and let t (x,y) be the prob- 
ability that x and y belong to the same open clus- 
ter. The correlation length £(p) is the mean distance 


between points on an open cluster, and it is a mea- 
sure of the linear size of finite clusters. For p < p c , 
t (x,y) ~ and §(p) ~ (p c - p) -v diverges 

with a universal critical exponent v as p -» p^r, as 
shown in figure 3(c). 

The effective conductivity cr*(p) of the lattice, now 
viewed as a random resistor (or conductor) network, 
defined via Kirchhoff’s laws, vanishes for p < p c as 
does Poo(p) since there are no infinite pathways, as 
shown in figure 3(e). For p > p c , cr*(p) > 0, and near 
Pc, 

a*{p) ~ a 0 (p - pc) 1 , p~Pc, 

where t is the conductivity critical exponent, with 1 ^ 
t^2ifd = 2,3 (for an idealized model), and numerical 
values t « 1.3 if d = 2 and t as 2.0 if d = 3. Consider a 
random pipe network with effective fluid permeability 
k*(p) exhibiting similar behavior k*(p) ~ ko(p - p c ) e , 
where e is the permeability critical exponent, with e = t 
for lattices. Both t and e are believed to be universal; 
that is, they depend only on dimension and not on the 
type of lattice. Continuum models, like the so-called 
Swiss cheese model, can exhibit nonuniversal behavior 
with exponents different from the lattice case and e * t. 

3.2 Analytic Continuation and Spectral Measures 

Homogenization is where one seeks to find a homoge- 
neous medium that behaves the same macroscopically 
as a given inhomogeneous medium. The methods are 
focused on finding the effective properties of inhomo- 
geneous media such as composites. We will see that the 
spectral measure in a Stieltjes integral representation 
for the effective parameter provides a powerful tool for 
upscaling geometrical information about a composite 
into calculations of effective properties. 

We now briefly describe the analytic continuation 
method for studying the effective transport properties 
of composite materials. This method has been used 
to obtain rigorous bounds on effective transport coef- 
ficients of two-component and multicomponent com- 
posite materials. The bounds follow from the special 
analytic structure of the representations for the effec- 
tive parameters and from partial knowledge of the 
microstructure, such as the relative volume fractions of 
the phases in the case of composite media. The analytic 
continuation method was later adapted to treating the 
effective diffusivity of passive tracers in incompressible 
fluid velocity fields. 

We consider the effective complex permittivity ten- 
sor f* of a two-phase random medium, although the 
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P = 1/3 




Figure 3 The two-dimensional square lattice percolation model (a) below its percolation threshold of p c = j and (b) above it 
(courtesy of Salvatore Torquato), (c) Divergence of the correlation length as p approaches p c . (d) The infinite cluster density 
of the percolation model, and (e) the effective conductivity. 


method applies to any classical transport coefficient. 
Here, e (x, to) is a (spatially) stationary random field in 
x e R d and co 6 Q, where Q is the set of all geomet- 
ric realizations of the medium, which is indexed by the 
parameter to representing one particular realization, 
and the underlying probability measure P is compatible 
with stationarity. 

As in sea ice, we assume we are dealing with a 
two-phase locally isotropic medium, so that the com- 
ponents Ejk of the local permittivity tensor £ satisfy 
£jk(x, to) = e(x, (jo)Sjk, where Sjk is the Kronecker 
delta and 

£(*,«)) = £ 1 X 1 (.*,«}) + £2X2(x, CO). (2) 

Here, Ej is the complex permittivity for medium j = 
1, 2, and Xj( x < to) is its characteristic function, equal- 
ing 1 for co E H with medium j at x , and 0 otherwise, 
with X 2 = 1 - AT- 

When the wavelength is much larger than the scale of 
the composite microstructure, the propagation prop- 
erties of an electromagnetic wave in a given compos- 
ite medium are determined by the quasistatic limit of 
Maxwell’s equations: 

V x £ = 0, V ■ D = 0, (3) 

where E(x, tu) and D(x, to) are stationary electric and 
displacement fields, respectively, related by the local 
constitutive equation D(x) = e(x)E(x), and Cfe is a 
standard basis vector in the kth direction. The elec- 
tric field is assumed to have unit strength on average, 
with (£) = ek, where (■) denotes ensemble averaging 
over Q or spatial averaging over all of R d . The effective 
complex permittivity tensor £* is defined by 

<£>>=£*<£>, (4) 

which is a homogenized version of the local constitu- 
tive relation D = eE. 

For simplicity, we focus on one diagonal coefficient 
e* = Ekk, with £* = (eE ■ ek). By the homogeneity of 


e ( x, co ) in (2), e* depends on the contrast parameter 
h = Ei/E 2 , and we define m(h) = e* Ie 2 , which is aHer- 
glotz function that maps the upper half h-plane to the 
upper half -plane and is analytic in the entire complex 
h-plane except for the negative real axis (-<», 0]. 

The key step in the method is obtaining a Stieltjes 
integral representation for £* . This integral representa- 
tion arises from a resolvent representation of the elec- 
tric field£ = s(sI-Txi)~ 1 ek, whereT = V(A -1 )V- acts 
as a projection from I 2 (12, P) onto the Hilbert space of 
curl-free random fields, and A -1 is based on convolu- 
tion with the free-space Green function for the Lapla- 
cian A = V 2 . Consider the function F(s) = 1 - m(h), 
s = 1/(1 - h), which is analytic off [0, 1] in the 5-plane. 
Then, writing F(s) = ixMsI - £xi) _1 efc] ■ ek) yields 

FW - f (5) 

Jo 5 - A 

where p(dA) = (xiQ(dA)ei. ■ ek) is a positive spec- 
tral measure on [0, 1], and Q(dA) is the (unique) pro- 
jection valued measure associated with the bounded, 
self-adjoint operator Fxi- 

Equation (5) is based on the spectral theorem for the 
resolvent of the operator Fx i- It provides a Stieltjes 
integral representation for the effective complex per- 
mittivity e* that separates the component parameters 
in 5 from the complicated geometrical information con- 
tained in the measure p. (Extensions of (5) to multicom- 
ponent media with e = e 1 X 1 + £ 2 X 2 + £ 3 X 3 + ■ ■ ■ + £nXn 
involve several complex variables.) Information about 
the geometry enters through the moments 

Pn = [ A"dp(A) = <xi[(£xi) n e fc ]-e fc >, 

Jo 

n = 0,1,2 

For example, the mass po is given by po = (xififc ■ ek) = 
(Xi) = 0. where <fi is the volume or area fraction of 
material of phase 1 , and pi = <p(l~4>) /dif the material 
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Figure 4 Realizations of the two-dimensional lattice perco- 
lation model are shown in (a) and (b), and the correspond- 
ing spectral functions (averaged over 5000 random realiza- 
tions) are shown in (c) and (d). In (c), there is a spectral gap 
around A = 1, indicating the lack of long-range order or con- 
nectedness. The gap collapses in (d) when the percolation 
threshold of p = p c = 0.5 has been reached, and the sys- 
tem exhibits long-range connectedness. Note the difference 
in vertical scale in the graphs in (c) and (d). 


is statistically isotropic. In general, p n depends on the 
(n + l)-point correlation function of the medium. 

Computing the spectral measure p for a given com- 
posite microstructure involves first discretizing a two- 
phase image of the composite into a square lattice 
filled with Is and 0s corresponding to the two phases. 
The key operator Txi, which depends on the geom- 
etry via xi, then becomes a self-adjoint matrix. The 
spectral measure may be calculated directly from the 
eigenvalues and eigenvectors of this matrix. Examples 
of these spectral measures for the percolation model 
on the two-dimensional square lattice are shown in 
figure 4. 


4 Applications to Sea Ice 
4. 1 Percolation Theory 

Given a sample of sea ice at temperature T in degrees 
Celsius and bulk salinity S in parts per thousand (ppt), 
the brine volume fraction <p is given (approximately) by 


the equation of Frankenstein and Garner: 


5 / 49.185 


■0.532 


)■ 


( 6 ) 


^ 1000 l |T| 

As temperature increases for fixed salinity, the volume 
fraction <p of liquid brine in the ice also increases. The 
inclusions become larger and more connected, as illus- 
trated in parts (a)-(c) of plate 6, which show images of 
the brine phase in sea ice (in gold) obtained from X-ray 
tomography scans of sea ice single crystals. 

As the connectedness of the brine phase increases 
with rising temperature, the ease with which fluid can 
flow through sea ice — its fluid permeability— should 
increase as well. In fact, sea ice exhibits a percolation 
threshold, or critical brine volume fraction <fi c , or crit- 
ical temperature T c , below which columnar sea ice is 
effectively impermeable to vertical fluid flow and above 
which the ice is permeable, and increasingly so as tem- 
perature rises. This critical behavior of fluid transport 
in sea ice is illustrated in plate 6(d). The data on the 
vertical fluid permeability k(<fi) display a rapid rise 
just above a threshold value of about 4> c ~ 0.05 or 
5%, similar to the conductivity (or permeability) in fig- 
ure 3(e). This type of behavior is also displayed by data 
on brine drainage, with the effects of drainage shut- 
ting down for brine volume fractions below about 5%. 
Roughly speaking, we can refer to this phenomenon as 
the on-off switch for fluid flow in sea ice. Through the 
Frankenstein-Garner relation in (6), the critical brine 
volume fraction <f> c ~ 0.05 corresponds to a criti- 
cal temperature T c ~ -5 °C, for a typical salinity of 
5 ppt. This important threshold behavior has therefore 
become known as the rule of fives. 

In view of this type of critical behavior, it is reason- 
able to try to find a theoretical percolation explanation. 
However, with p c ~ 0.25 for the d = 3 cubic bond 
lattice, it was apparent that key features of the geom- 
etry of the brine microstructure in sea ice were being 
missed by lattices. The threshold <f> Q « 0.05 was iden- 
tified with the critical probability in a continuum per- 
colation model for compressed powders that exhibit 
microstructural characteristics similar to sea ice. The 
identification explained the rule of fives, as well as data 
on algal growth and snow-ice production. The com- 
pressed powders shown in figure 5 were used in the 
development of so-called stealthy or radar-absorbing 
composites. 

When we applied the compressed powder model 
to sea ice, we had no direct evidence that the brine 
microstructure undergoes a transition in connected- 
ness at a critical brine volume fraction. This lack of 
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Figure 5 (a) A powder of large polymer spheres mixed with smaller metal spheres, (b) When the powder is compressed, its 
microstructure is similar to that of the sea ice in (c). (Parts (a) and (b) are adapted from Golden, K. M., S. F. Ackley, and V. I. 
Lytle. Science 18 December 1998:282 (5397), 2238-2241. Part (c) is adapted from CRREL report 87-20 (October 1987).) 


evidence was partly due to the difficulty of imaging 
and quantitatively characterizing the brine inclusions 
in three dimensions, particularly the thermal evolution 
of their connectivity. Through X-ray computed tomog- 
raphy and pore structure analysis we have now ana- 
lyzed the critical behavior of the thermal evolution of 
brine connectedness in sea ice single crystals over a 
temperature range from -18 °C to -3 °C. We have 
mapped three-dimensional images of the pores and 
throats in the brine phase onto graphs of nodes and 
edges, and analyzed their connectivities as functions of 
temperature and sample size. Realistic network mod- 
els of brine inclusions can be derived from porous 
media analysis of three-dimensional microtomography 
images. Using finite-size scaling techniques largely con- 
firms the rule of fives, as well as confirming that sea ice 
is a natural material that exhibits a strong anisotropy 
in percolation thresholds. 

Now we consider the application of percolation 
theory to understanding the fluid permeability of sea 
ice. In the continuum, the permeability and conduc- 
tivity exponents e and t can take nonuniversal values 
and need not be equal, as in the case of the three- 
dimensional Swiss cheese model. Continuum models 
have been studied by mapping to a lattice with a prob- 
ability density <p(g) of bond conductances g. Nonuni- 
versal behavior can be obtained when ip(g) is singu- 
lar as g — ■ 0 + . However, for a lognormal conductance 
distribution arising from intersections of lognormally 
distributed inclusions, as in sea ice, the behavior is 
universal. Thus e » 2 for sea ice. 

The permeability scaling factor fco for sea ice is esti- 
mated using critical path analysis. For media with g in 
a wide range, the overall behavior is dominated by a 
critical bottleneck conductance g c , the smallest conduc- 
tance such that the critical path {g\ g ^ g c } spans the 


sample. With most brine channel diameters between 
1.0 mm and 1.0 cm, spanning fluid paths have a small- 
est characteristic radius r c « 0.5 mm, and we estimate 
ko by the pipe-flow result r 2 / 8. Thus, 

k(4>) ~ 3(<p - 4>c) 2 x 1CT 8 m 2 , <£-</>+. (7) 

In plate 6(f), held data with <fi in [0.055,0.15], just 
above <fi c ~ 0.05, are compared with (7) and show close 
agreement. The striking result that, for sea ice, e ~ 2, 
the universal lattice value in three dimensions, is due to 
the general lognormal structure of the brine inclusion 
distribution function. The general nature of our results 
suggests that similar types of porous media, such as 
saline ice on extraterrestrial bodies, may also exhibit 
universal critical behavior. 

4.2 Analytic Continuation 

4.2.1 Bounds on the Effective Complex Permittivity 

Bounds on a*, or F(s), are obtained by fixing s in 
(5), varying over admissible measures p (or admissible 
geometries), such as those that satisfy only 

Po = <£, (8) 

and finding the corresponding range of values of F(s) 
in the complex plane. Two types of bounds on a* are 
obtained. The first bound R\ assumes only that the rel- 
ative volume fractions p\ = <p and P 2 = 1 - Pi of 
the brine and ice are known, so that (8) is satisfied. In 
this case, the admissible set of measures forms a com- 
pact, convex set. Since (5) is a linear functional of p, 
the extreme values of F are attained by extreme points 
of the set of admissible measures, which are the Dirac 
point measures of the form p i<5 z . The values of F must 
lie inside the circle p\/(s - z), -oo ^ z ^ oo, and the 
region R \ is bounded by circular arcs, one of which is 
parametrized in the F-plane by 

Ci(z) = -^— , 0^z^p 2 . (9) 

s - z 
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To display the other arc, it is convenient to use the 
auxiliary function 


E(s) = 1-4 


1 - sF(s) 


( 10 ) 


£* 5(1 - F(s)) ' 

which is a Herglotz function like F (s), analytic off [0, 1]. 
Then, in the E-plane, we can parametrize the other 
circular boundary of R\ by 
P2 


C\(z) 


0 ^ z ^ pi. 


( 11 ) 


In the £*-plane, Ri has vertices VT = £i/(l - Ci(0)) = 
(pi/£i+p 2 /£ 2 )~ 1 and W\ = f 2 (l-Ci(0)) = Pi£i + P 2 £ 2 , 
and collapses to the interval 

(Pl/£l + P2 / ^2 ) _1 ^ £* Plfl + P2£2 (12) 


when £i and £2 are real, which are the classical arith- 
metic (upper) and harmonic (lower) mean bounds, also 
called the elementary bounds. The complex elementary 
bounds (9) and (11) are optimal and can be attained by a 
composite of uniformly aligned spheroids of material 1 
in all sizes coated with confocal shells of material 2 , and 
vice versa. These arcs are traced out as the aspect ratio 
varies. 

If the material is further assumed to be statistically 
isotropic, i.e., if e* k = £*5^, thenpi = <fi(l-<fi)/d must 
be satisfied as well. A convenient way of including this 
information is to use the transformation 


Fils) 


(13) 


_ 1 

Pi sF(s ) ' 

The function Fi ( 5 ) is, again, a Herglotz function, which 
has the representation 



dp 1 (z) 
5 - z 


The constraint pi = <J>( 1 - (p)/d on F(s) is then trans- 
formed to a restriction of only the mass, or zeroth 
moment Pq, of p 1 , with 


Po = P 2 lPid. 


Applying the same procedure as for R\ yields a region 
R 2 whose boundaries are again circular arcs. When 
£1 and £ 2 are real with £1 ^ £ 2 , the region col- 
lapses to a real interval, whose endpoints are known as 
the Hashin-Shtrikman bounds. We remark that higher- 
order correlation information can be conveniently 
incorporated by iterating (13). 


4.2.2 Inverse Homogenization 

It has been shown that the spectral measure p, which 
contains all geometrical information about a compos- 
ite, can be uniquely reconstructed if measurements of 


the effective permittivity £* are available on an arc 
in the complex 5-plane. If the component parameters 
depend on frequency to (not to be confused with real- 
izations of the random medium above), variation of 
to in an interval (toi,a) 2 ) gives the required data. 
The reconstruction of p can be reduced to an inverse 
potential problem. Indeed, F(s) admits a representa- 
tion through a logarithmic potential <P of the measure 
P, 

F(s)=^~, <f>(5) = f In |5 - z| dp(z), (14) 

where d/ds = d/dx - id/dy. The potential <P satisfies 
the Poisson equation A4> = -p, where p(z) is a density 
on [0, 1]. A solution to the forward problem is given 
by the Newtonian potential with p(dz) = p(z) dz. The 
inverse problem is to find p(z) (or p) given values of 
<P, d$ /d n, or V<P. The inverse problem is ill-posed and 
requires regularization [IV.4 §7] to develop a stable 
numerical algorithm. 

When frequency to varies across (coi, ( 02 ), the com- 
plex parameter 5 traces an arc C in the 5-plane. Let A 
be the integral operator in (14), 

Ap = 4 f In |5 - A| dp(A), 
os Jo 

mapping the set of measures M[ 0 , 1] on the unit inter- 
val onto the set of derivatives of complex potentials 
defined on a curve C. To construct the solution we con- 
sider the problem of minimizing \\Ap-F\\ 2 over p e M, 
where || ■ || istheI 2 (C)-norm, E(5) is the measured data, 
and 5 G C. The solution does not depend continuously 
on the data, and regularization based on constrained 
minimization is needed. Instead of ||Ap - E|| 2 being 
minimized over all functions in M, it is minimized over 
a convex subset satisfying /(p) ^ P for a stabilizing 
functional J(p) and some ft > 0. The advantage of 
using quadratic J(p) = ||Lp|| 2 is the linearity of the 
corresponding Euler equation, resulting in efficiency 
of the numerical schemes. However, the reconstructed 
solution necessarily possesses a certain smoothness. 
Nonquadratic stabilization imposes constraints on the 
variation of the solution. The total variation penaliza- 
tion, as well as a nonnegativity constraint, does not 
imply smoothness, permitting more general recovery, 
including the important Dirac measures. 

We have also solved a reduced inverse spectral 
problem exactly by bounding the volume fraction of 
the constituents, an inclusion separation parameter q, 
and the spectral gap of Exi- We developed an algo- 
rithm based on the Mobius transformation structure 
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of the forward bounds whose output is a set of alge- 
braic curves in parameter space bounding regions of 
admissible parameter values. These results advance 
the development of techniques for characterizing the 
microstructure of composite materials, and they have 
been applied to sea ice to demonstrate electromagneti- 
cally that the brine inclusion separations vanish as the 
percolation threshold is approached. 

5 Geometry of the Marginal Ice Zone 

Dense pack ice transitions to open ocean over a 
region of broken ice termed the marginal ice zone 
(MIZ), a highly dynamic region in which the ice cover 
lies close to an open ocean boundary and where 
intense atmosphere-ice-ocean interactions take place. 
The width of the MIZ is a fundamental length scale for 
polar dynamics, in part because it represents the dis- 
tance over which ocean waves and swell penetrate into 
the sea ice cover. Wave penetration can break a smooth 
ice layer into floes, meaning that the MIZ acts as a buffer 
zone that protects the stable morphology of the inner 
ice. Waves also promote the formation of pancake ice, 
as shown in plate 7. Moreover, the width of the MIZ 
is an important spatial dimension of the marine polar 
habitat and impacts human accessibility to high lati- 
tudes. Using a conformal mapping method to quan- 
tify MIZ width (see below), a dramatic 39% widening 
of the summer Arctic MIZ, based on three decades of 
satellite-derived data (1979-2012), has been reported. 

Challenges associated v\lth objective measurement 
of the MIZ width include the MIZ’s shape, which is in 
general not geodesically convex, as illustrated by the 
shaded example in plate 8(a). Sea ice concentration (c) 
is used here to define the MIZ as a body of marginal ice 
(0.15 ^ c ^ 0.80) adjoining both pack ice (c > 0.80) and 
sparse ice (c < 0.15). To define an objective MIZ width 
applicable to such shapes, an idealized sea ice concen- 
tration field < p(x,y) satisfying laplace’s equation 
[III. 18] within the MIZ, 

V 2 ip = 0, (15) 

was introduced. We use (x, y ) to denote a point in two- 
dimensional space, and it is understood that we are 
working on the spherical Earth. Boundary conditions 
for (15) are ip = 0.15 where MIZ borders a sparse ice 
region and ip = 0.80 where the MIZ borders a pack 
ice region. The solutions to (15) for the examples in 
parts (a) and (b) of plate 8 are illustrated by colored 
shading. Any curve y orthogonal to the level curves 
of ip and connecting two points on the MIZ perimeter 
(a black held line through the gradient held Vi p, as in 


plate 8(b)) is contained in the MIZ, and its length pro- 
vides an objective measure of MIZ width (£). Dehned 
in this way, £ is a function of distance along the MIZ 
perimeter ( 5 ) from an arbitrary starting point, and this 
dependence is denoted by £ = £(s). Analogous appli- 
cations of Laplace’s equation have been introduced in 
medical imaging to measure the width or thickness of 
human organs. 

Derivatives in (15) were numerically approximated 
using second-order finite differences, and solutions 
were obtained in the data’s native stereographic projec- 
tion since solutions of Laplace’s equation are invariant 
under conformal mapping, for a given day and MIZ, 
a summary measure of MIZ width (w) can be dehned 
by averaging £ with respect to distance along the MIZ 
perimeter: 

w = f £(s)ds, (16) 

Lm jm 

where M is the closed curve defining the MIZ perime- 
ter and Lm is the length of M. Averaging iv over July- 
September of each available year reveals the dramatic 
widening of the summer MIZ, as illustrated in plate 8(c). 

6 Geometry of Arctic Melt Ponds 

From the hrst appearance of visible pools of water, 
often in early June, the area fraction <p of sea ice cov- 
ered by melt ponds can increase rapidly to over 70% in 
just a few days. Moreover, the accumulation of water 
at the surface dramatically lowers the albedo where 
the ponds form. There is a corresponding critical drop- 
off in average albedo. The resulting increase in solar 
absorption in the ice and upper ocean accelerates melt- 
ing, possibly triggering ice-albedo feedback. Similarly, 
an increase in open-water fraction lowers albedo, thus 
increasing solar absorption and subsequent melting. 
The spatial coverage and distribution of melt ponds 
on the surface of ice hoes and the open water between 
the hoes thus exerts primary control of ice pack albedo 
and the partitioning of solar energy in the ice-ocean 
system. Given the critical role of ice-albedo feedback 
in the recent losses of Arctic sea ice, ice pack albedo 
and the formation and evolution of melt ponds are of 
significant interest in climate modeling. 

While melt ponds form a key component of the Arc- 
tic marine environment, comprehensive observations 
or theories of their formation, coverage, and evolution 
remain relatively sparse. Available observations of melt 
ponds show that their areal coverage is highly vari- 
able, particularly for hrst-year ice early in the melt sea- 
son, with rates of change as high as 35% per day. Such 
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Figure 6 (a) Area-perimeter data for 5269 Arctic melt 
ponds, plotted on logarithmic scales, (b) Melt pond frac- 
tal dimension D as a function of area A, computed from 
the data in (a). Ponds corresponding to the three black 
stars in (a), from left to right, are denoted by (c), (d), 
and (e), respectively, in the bottom diagram. The transi- 
tional pond in (d) has a horizontal scale of about 30 m. 
(Adapted from Hohenegger, C., B. Alali, K. R. Steffen, D. K. 
Perovich, and K. M. Golden. 2012. Transition in the fractal 
geometry of Arctic melt ponds. The Cryosphere 6:1157—62 
(doi:10.5194/tc-6-l 157-2012).) 

variability, as well as the influence of many competing 
factors controlling the evolution of melt ponds and ice 
floes, makes the incorporation of realistic treatments 
of albedo into climate models quite challenging. Small- 
and medium-scale models of melt ponds that include 
some of these mechanisms have been developed, and 
melt pond parametrizations are being incorporated 
into global climate models. 


The surface of an ice floe is viewed here as a two- 
phase composite of dark melt ponds and white snow or 
ice. The onset of ponding and the rapid increase in cov- 
erage beyond the initial threshold is similar to critical 
phenomena in statistical physics and composite mate- 
rials. It is natural, therefore, to ask if the evolution of 
melt pond geometry exhibits universal characteristics 
that do not necessarily depend on the details of the 
driving mechanisms in numerical melt pond models. 
Fundamentally, the melting of Arctic sea ice is a phase- 
transition phenomenon, where a solid turns to liquid, 
albeit on large regional scales and over a period of time 
that depends on environmental forcing and other fac- 
tors. We thus look for features of melt pond evolu- 
tion that are mathematically analogous to related phe- 
nomena in the theories of phase transitions and com- 
posite materials. As a first step in this direction, we 
consider the evolution of the geometric complexity of 
Arctic melt ponds. 

By analyzing area-perimeter data from hundreds of 
thousands of melt ponds, we have discovered an unex- 
pected separation of scales, where the pond fractal 
dimension D exhibits a transition from 1 to 2 around a 
critical length scale of 100 m 2 in area, as shown in fig- 
ure 6. Small ponds with simple boundaries coalesce or 
percolate to form larger connected regions. Pond com- 
plexity increases rapidly through the transition region 
and reaches a maximum for ponds larger than 1 000 m 2 , 
whose boundaries resemble space-filling curves with 
D « 2. These configurations affect the complex radi- 
ation fields under melting sea ice, the heat balance of 
sea ice and the upper ocean, under-ice phytoplankton 
blooms, biological productivity, and biogeochemical 
processes. 

Melt pond evolution also appears to exhibit a per- 
colation threshold, where one phase in a composite 
becomes connected on macroscopic scales as some 
parameter exceeds a critical value. An important exam- 
ple of this phenomenon in the microphysics of sea ice 
(discussed above), which is fundamental to the pro- 
cess of melt pond drainage, is the percolation tran- 
sition exhibited by the brine phase in sea ice, or the 
rule of fives discussed on page 697. When the brine 
volume fraction of columnar sea ice exceeds approx- 
imately 5%, the brine phase becomes macroscopically 
connected so that fluid pathways allow flow through 
the porous microstructure of the ice. Similarly, even 
casual inspection of the aerial photos in plate 9 shows 
that the melt pond phase of sea ice undergoes a perco- 
lation transition where disconnected ponds evolve into 
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much larger-scale connected structures with complex 
boundaries. Connectivity of melt ponds promotes fur- 
ther melting and the breakup of floes, as well as hor- 
izontal transport of meltwater and drainage through 
cracks, leads, and seal holes. 
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V.18 Numerical Weather Prediction 

Peter Lynch 


1 Introduction 

The development of computer models for numeri- 
cal simulation and prediction of the atmosphere and 
oceans is one of the great scientific triumphs of the 
past fifty years. Today, numerical weather prediction 
(NWP) plays a central and essential role in operational 
weather forecasting, with forecasts now having accu- 
racy at ranges beyond a week. There are several rea- 
sons for this: enhancements in model resolution, better 
numerical schemes, more realistic parametrizations of 
physical processes, new observational data from satel- 
lites, and more sophisticated methods of determining 
the initial conditions. In this article we focus on the fun- 
damental equations, the formulation of the numerical 
algorithms, and the variational approach to data assim- 
ilation. We present the mathematical principles of NWP 
and illustrate the process by considering some specific 
models and their application to practical forecasting. 

2 The Basic Equations 

The atmosphere is governed by the fundamental laws 
of physics, expressed in terms of mathematical equa- 
tions. They form a system of coupled nonlinear partial 
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differential equations (PDEs). These equations can be 
used to predict the evolution of the atmosphere and to 
simulate its long-term behavior. 

The primary variables are the fluid velocity V (with 
three components, u eastward, v northward, and iv 
upward), pressure p, density p, temperature T, and 
humidity q. Using Newton’s laws of motion and the 
principles of conservation of energy and mass, we can 
obtain a system whose solution is well determined by 
the initial conditions. 

The central components of the system, govern- 
ing fluid motion, are the navier-stokes equations 
[III.23], We write them in vector form: 

dV 1 

+ V ■ V V + 20 x V H — Vp = F + g . 

d t p 

The equations are relative to the rotating Earth and 
Cl is the Earth’s angular velocity. In order, the terms 
of this equation represent local acceleration, nonlin- 
ear advection, Coriolis term, pressure gradient, friction, 
and gravity. The friction term F is small in the free 
atmosphere but is crucially important in the boundary 
layer (roughly, the first 1 km above the Earth’s surface). 
The apparent gravity g includes the centrifugal force, 
which depends only on position. 

The temperature, pressure, and density are linked 
through the equation of state 

p = RpT, 

where R is the gas constant for dry air. In practice, a 
slight elaboration of this is used that takes account of 
moisture in the atmosphere. 

Energy conservation is embodied in the first law of 
thermodynamics, 

Cv^ +RTV ■ V = Q, 
at 

where c v is the specific heat at constant volume and 
Q is the diabatic heating rate. Conservation of mass is 
expressed in terms of the continuity equation: 

f + P V.V = 0. 

Finally, conservation of water substance is expressed 
by the equation 

dq _ 
dt ’ 

where q is the specific humidity and S represents all 
sources and sinks of water vapor. 

Once initial conditions, appropriate boundary con- 
ditions, and external forcings, sources, and sinks are 


given, the above system of seven (scalar) equations pro- 
vides a complete description of the evolution of the 
seven variables {u, v ,w,p, p, T, q}. 

For large-scale motions the vertical component of 
velocity is very much smaller than the horizontal com- 
ponents, and we can replace the vertical equation by 
a balance between the vertical pressure gradient and 
gravity. This yields the hydrostatic equation 


Hydrostatic models were used for the first fifty years of 
NWP but nonhydrostatic models are now coming into 
widespread use. 

3 The Emergence of NWP 

The idea of calculating the changes in the weather 
by numerical methods emerged around the turn of 
the twentieth century. Cleveland Abbe, an American 
meteorologist, viewed weather forecasting as an appli- 
cation of hydrodynamics and thermodynamics to the 
atmosphere. He also identified a system of mathemat- 
ical equations, essentially those presented in section 2 
above, that govern the evolution of the atmosphere. 
This idea was developed in greater detail by the Norwe- 
gian Vilhelm Bjerknes, whose stated goal was to make 
meteorology an exact science: a true physics of the 
atmosphere. 

3.1 Richardson’s Forecast 

During World War I, Lewis Fry Richardson, an English 
Quaker mathematician, calculated the changes in the 
weather variables directly from the fundamental equa- 
tions and presented his results in a book, Weather 
Prediction by Numerical Process, in 1922. His predic- 
tion of pressure changes was utterly unrealistic, being 
two orders of magnitude too large. The primary cause 
of this failure was the inaccuracy and imbalance of 
the initial conditions. Despite the outlandish results, 
Richardson's methodology was unimpeachable, and is 
essentially the approach we use today to integrate the 
equations. 

Richardson was several decades ahead of his time. 
For computational weather forecasting to become a 
practical reality, advances on a number of fronts were 
required. First, an observing system for the tropo- 
sphere, the lowest layer of the atmosphere, extending 
to about 12 km, was established to serve the needs of 
aviation; this also provided the initial data for weather 
forecasting. Second, advances in numerical analysis led 
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to the design of stable and accurate algorithms for solv- 
ing the PDEs. Third, progress in meteorological theory, 
especially the development of the quasigeostrophic 
equations and improved understanding of atmospheric 
balance, provided a means to eliminate the spurious 
high-frequency oscillations that had spoiled Richard- 
son's forecast. Finally, the invention of high-speed dig- 
ital computers enabled the enormous computational 
task of solving the equations to be undertaken. 


3.2 The ENIAC Integrations 


The first forecasts made using an automatic com- 
puter were completed in 1950 on the ENIAC (Elec- 
tronic Numerical Integrator and Computer), the first 
programmable general-purpose computer. The fore- 
casts used a highly simplified model, representing the 
atmosphere as a single layer and assuming conserva- 
tion of absolute vorticity expressed by the barotropic 
vorticity equation, 


d_ 

dt 


(£ + /) = 0 , 


where £ is the vorticity of the flow and / = 213 sin (p 
is the Coriolis parameter, with 13 the angular velocity 
of the Earth and <p the latitude. The Lagrangian time 
derivative 


includes the nonlinear advection by the flow. The equa- 
tion was approximated by finite differences in space 
and time with a grid size of 736 km (at the North Pole) 
and a time step of three hours. The resulting forecasts, 
while far from perfect, were realistic and provided a 
powerful stimulus for further work. 

Baroclinic, or multilevel, models that enabled realis- 
tic representation of the vertical structure of the atmo- 
sphere were soon developed. Moreover, the simplified 
equations were replaced by more accurate primitive 
equations, that is, the equations presented in section 2 
but with the hydrostatic approximation. As these equa- 
tions simulate high-frequency gravity waves in addition 
to the motions that are important for weather, the ini- 
tial conditions must be carefully balanced. Techniques 
for ensuring this were developed. Most notable among 
these was the normal-mode initialization method: the 
flow is resolved into normal modes and modified to 
ensure that the tendencies, or rates of change, of 
the gravity wave components vanish. This suppresses 
spurious oscillations. 


4 Solving the Equations 

Analytical solution of the equations is impossible, so 
approximate methods must be employed. We consider 
methods of discretizing the spatial domain to reduce 
the PDEs to an algebraic system and of advancing the 
solution in time. 


4.1 Time-Stepping Schemes 


Let Q denote a typical dependent variable, governed by 
an equation of the form 


d Q 

dt 


T(Q). 


We replace the continuous-time domain t by a sequence 
of discrete times {0, At, 2At, . . . ,nAt, . . . }, with the 
solution at these times denoted by Q n = Q(nAt). 
If this solution is known up to time t = nAt, the 
right-hand term F n = F(Q n ) can be computed. The 
time derivative is now approximated by a centered 
difference 


2At 


so the “forecast” value Q n+1 may be computed from 
the old value Q” _1 and the tendency F n \ 


Q _ n+ 1 = Q"- 1 + 2A tF n . 


This is called the leapfrog scheme. The process of step- 
ping forward from moment to moment is repeated a 
large number of times, until the desired forecast range 
is reached. 

The leapfrog scheme is limited by a stability criterion 
that restricts the size of the time step At. One way of 
circumventing this is to use an implicit scheme such as 

Qn+1 _ Qn- 1 pn-1 _|_ pn+ 1 

2A t = 2 ' 

The time step is now unconstrained by stability, but the 
scheme requires the solution of the equation 

Q n+1 - A tF n+1 = Q"- 1 + A tF n -\ 

which is prohibitive unless F(Q) is a linear function. 
Normally, implicit schemes are used only for particular 
(linear) terms of the equations. 


4.2 Spatial Finite Differencing 

For the PDEs that govern atmospheric dynamics we 
must replace continuous variations in space by discrete 
variables. The primary way to do this is to substitute 
finite-difference approximations for the spatial deriva- 
tives. It then transpires that the stability depends on 
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the relative sizes of the space and time steps. A real- 
istic solution is not guaranteed by reducing their sizes 
independently. 

We consider the simple one -dimensional wave equa- 
tion 


where Q(x, t) depends on both x and t, and where the 
advection speed c is constant. We consider the sinu- 
soidal solution Q = Q 0 e ik(x-ct) Q f wavelength L = 
2n/k. We use centered difference approximations in 
both space and time: 


Qm +1 ~ Q-m 1 , J Qm + l-Qm-A n 
2A t V 2 Ax ) ’ 

where QJ^ = Q(mAx,nAt). We seek a solution of 
the form QJ, = o°e' k<mAx ^ CnAt K p or rea j ^ t j^ s j s a 
wavelike solution. However, if C is complex, this solu- 
tion will behave exponentially, quite unlike the solution 
of the continuous equation. Substituting Q” into the 
finite-difference equation, we find that 

c = m sin ^[{^) sinkAx l 

If the argument of the inverse sine is less than unity, C 
is real. Otherwise, C is complex, and the solution will 
grow with time. Thus, the condition for stability of the 
solution is 


This is the Courant-Friedrichs-Lewy criterion, discov- 
ered in 1928. It imposes a strong constraint on the rel- 
ative sizes of the space and time grids. The limitation 
on stability can be circumvented by means of implicit 
finite differencing. Then 

C = — tan -1 [ ( sinkAxl. 
kAt L\2Ax/ J 


The numerical phase speed C is always real, so the 
implicit scheme is unconditionally stable, but the cost 
is that a linear system must be solved at each time step. 


4.3 Spectral Method 

In the spectral method, each field is expanded in a 
series of spherical harmonics: 

oo n 

Q(A,<M)=X X Qn(t)Y™( A,0), 

n = 0 m=-n 

where the coefficients Q™ (t) depend only on time, and 
where Y™( A, <p) are the spherical harmonics 

Y™{\,4>) = e imA P™($) 

for longitude A and latitude <p. The coefficients Q™ of 
the harmonics provide an alternative to specifying the 


field values Q( in the spatial domain. When the 
model equations are transformed to spectral space they 
become a coupled set of equations (ordinary differen- 
tial equations) for the spectral coefficients Q™. These 
are used to advance the coefficients in time, after which 
the new physical fields may be computed. 

In practice, the series expansion must be truncated 
at some point: 

N Tl 

Q(A ; , 4>j, t) = X I QZltWZiAuQj). 

n = 0 m=-n 

This is called triangular truncation, and the value of N 
indicates the resolution of the model. There is a compu- 
tational grid, called the Gaussian grid, corresponding to 
the spectral truncation. 

5 Initial Conditions 

Numerical weather prediction is an initial-value prob- 
lem; to integrate the equations of motion we must spec- 
ify the values of the dependent variables at an initial 
time. The numerical process then generates the val- 
ues of these variables at later times. The initial data 
are ultimately derived from direct observations of the 
atmosphere. 

The optimal interpolation analysis method was, for 
several decades, the most popular method of automatic 
analysis for NWP. This method optimizes the combina- 
tion of information in the background (forecast) field 
and in the observations, using the statistical proper- 
ties of the forecast and observation errors to produce 
an analysis that, in a precise statistical sense, is the best 
possible analysis. 

An alternative approach to data assimilation is to find 
the analysis field that minimizes a cost function. This is 
called variational assimilation and it is equivalent to the 
statistical technique known as the maximum-likelihood 
estimate, subject to the assumption of Gaussian errors. 
When applied at a specific time, the method is called 
three-dimensional variational assimilation, or 3D-Var 
for short. When the time dimension is also taken into 
account, we have 4D-Var. 

5.1 Variational Assimilation 

The cost function for 3D-Var maybe defined as the sum 
of two components: 

J = /b + Jo- 

We represent the model state by a high-dimensional 
vector X. The term 

Jb = liX-XzfB-'iX-Xv) 
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represents the distance between the model state X and 
the background field Xb weighted by the background 
error covariance matrix B. The term 

Jo = ^Y-HXfR-HY-HX) 

represents the distance between the analysis and the 
observed values Y weighted by the observation error 
covariance matrix R. The observation operator H is a 
rectangular matrix that converts the background field 
into first-guess values of the observations. More gener- 
ally, the observation operator is nonlinear but, for ease 
of description, we assume here that it is linear. 

The minimum of J is attained at X = Xa, where 

VxJ = 0, 

that is, where the gradient of J with respect to each of 
the analyzed values is zero. Computing this gradient, 
we get 

V X J = B^ 1 (X-X S ) + H T R- 1 (Y-HX). 

Setting this to zero we can deduce the expression 
X = X B + K(Y - HX B ). 

Thus, the analysis is obtained by adding to the back- 
ground field a weighted sum of the difference between 
observed and background values. The matrix K, the 
gain matrix, is given by 

K = BH t (R + HBH 7 )- 1 . 

The analysis error covariance is then given by 

A = (I - KH)B. 

The minimum of the cost function is found using 
a descent algorithm such as the conjugate gradi- 
ent method [IV. 1 1 §4.1]; 3D-Var solves the minimiza- 
tion problem directly, avoiding computation of the gain 
matrix. 

The 3D-Var method has enabled the direct assimi- 
lation of satellite radiance measurements. The error- 
prone inversion process, whereby temperatures are 
deduced from the radiances before assimilation, is thus 
eliminated. Quality control of these data is also eas- 
ier and more reliable. As a consequence, the accuracy 
of forecasts has improved markedly since the intro- 
duction of variational assimilation. The accuracy of 
medium-range forecasts is now about equal for the two 
hemispheres (see figure 1). This is due to better satel- 
lite data assimilation. Satellite data are essential for the 
Southern Hemisphere as conventional data are in such 
short supply. The extraction of useful information from 
satellite soundings has been one of the great research 
triumphs of NWP over the past forty years. 


5.2 Inclusion of the Time Dimension 

Whereas conventional meteorological observations are 
made at the main synoptic hours, satellite data are 
distributed continuously in time. To assimilate these 
data, it is necessary to perform the analysis over a 
time interval rather than for a single moment. This is 
also more appropriate for observations that are dis- 
tributed inhomogeneously in space. Four-dimensional 
variational assimilation, or 4D-Var for short, uses all 
the observations within an interval to ^ t < tjv. The 
cost function has a term Jb measuring the distance to 
the background held Xb at the initial time to, just as in 
3D-Var. It also contains a summation of terms measur- 
ing the distance to observations at each time step t n in 
the interval [to, tjv]: 

N 

J = 7b + X Jo(tn), 

n = 0 

where Jb is defined as for 3D-Var and Jo(t n ) is given 
by 

Jo(tn) = ( Yn ~ H n X n ) J R n l (Yn - H n X n ). 

The state vector X n at time t n is generated by integra- 
tion of the forecast model from time to to t n , written 
X n = M n (X o). The vector Y n contains the observations 
valid at time t n . 

Just as the observation operator had to be linearized 
to obtain a quadratic cost function, we linearize the 
model operator M n about the trajectory from the back- 
ground held, obtaining what is called the tangent linear 
model operator M n . Then we hnd that 4D-Var is for- 
mally similar to 3D-Var with the observation operator 
H replaced by H n M n . Just as the minimization of J 
in 3D-Var involved the transpose of H, the minimiza- 
tion in 4D-Var involves the transpose of H n M n , which 
is Mjffjiy The operator Mj t , the transpose of the tan- 
gent linear model, is called the adjoint model. The con- 
trol variable for the minimization of the cost function 
is Xo, the model state at time to, and the sequence of 
analyses X n satishes the model equations, that is, the 
model is used as a strong constraint. 

The 4D-Var method finds initial conditions Xo such 
that the forecast best fits the observations within the 
assimilation interval. This removes an inherent disad- 
vantage of optimal interpolation and 3D-Var, where all 
observations within a fixed time window (typically of 
six hours) are assumed to be valid at the analysis time. 
The introduction of 4D-Var at the European Centre 
for Medium-Range Weather Forecasts (ECMWF) led to 
a significant improvement in the quality of operational 
medium-range forecasts. 



710 


V. Modeling 



Figure 1 Anomaly correlation (%) of 500 hPa geopotential height: twelve-month running mean (©ECMWF). 


6 Forecasting Models 

Operational forecasting today is based on output from 
a suite of computer models. Global models are used for 
predictions of several days ahead, while shorter-range 
forecasts are based on regional or limited-area models. 

6. 1 The ECMWF Global Model 

As an example of a global model we consider the inte- 
grated forecast system (IFS) of the ECMWF (which is 
based in Reading, in the United Kingdom). The ECMWF 
produces a wide range of global atmospheric and 
marine forecasts and disseminates them on a regu- 
lar schedule to its thirty-four member and cooperat- 
ing states. The primary products are deterministic fore- 
casts for the atmosphere out to ten days ahead, based 
on a high-resolution model, and probabilistic forecasts, 
extending to a month, made using a reduced resolution 
and an ensemble of fifty-one model runs. 

The basis of the NWP operations at the ECMWF is the 
IFS. It uses a spectral representation of the meteoro- 
logical fields. The IFS system underwent major resolu- 
tion upgrades in 2006 and in 2010. Table 1 compares 
the spatial resolutions of the three model cycles, indi- 
cating the substantial improvements in model resolu- 
tion in recent years. The truncation of the deterministic 
model is now T1279; that is, the spectral expansion is 


Table 1 Upgrades to the ECMWF IFS in 2006 and 2010. 
The spectral resolution is indicated by the triangular trun- 
cation number, and the effective resolution of the associ- 
ated Gaussian grid is indicated. The number of model lev- 
els, or layers used to represent the vertical structure of the 
atmosphere, is also given. 



Before 

2006 

2006-9 

After 

2009 

Spectral truncation 

T511 

T799 

T1279 

Effective resolution 

39 km 

25 km 

16 km 

Model levels 

60 

91 

137 


terminated at total wave number 1279. This is equiv- 
alent to a spatial resolution of 16 km. The number of 
model levels in the vertical has recently been increased 
to 137. The new Gaussian grid for the IFS has about 
2 x 10 6 points. With 137 levels and five primary prog- 
nostic variables at each point, about 1.2 x 10 9 num- 
bers are required to specify the atmospheric state at 
a given time. That is, the model has about a billion 
degrees of freedom. The computational task of making 
forecasts with such high resolution is truly formidable. 
The ECMWF carries out its operational program using 
a powerful and complex computer system. At the heart 
of this system is a Cray XC30 high-performance com- 
puter, comprising some 160 000 processors, with a sus- 
tained performance of over 200 teraflops (2 x 10 14 
floating-point operations per second). 
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6.2 Mesoscale Modeling 

Short-range forecasting requires detailed guidance that 
is updated frequently. Many national meteorological 
services run limited-area models with high resolution 
to provide such forecast guidance. These models per- 
mit a free choice of geographical area and spatial res- 
olution, and forecasts can be run as frequently as 
required. Limited-area models make available a com- 
prehensive range of outputs, with a high time resolu- 
tion. Nested grids with successively higher resolution 
can be used to provide greater local detail. 

The Weather Research and Forecasting Model is a 
next-generation mesoscale NWP system developed in a 
partnership involving American national agencies (the 
National Centers for Environmental Prediction and the 
National Center for Atmospheric Research) and uni- 
versities. It is designed to serve the needs of both 
operational forecasting and atmospheric research. The 
Weather Research and Forecasting Model is suitable for 
a broad range of applications, from meters to thou- 
sands of kilometers, and it is currently in operational 
use at several national meteorological services. 1 

6.3 Ensemble Prediction 

The chaotic nature of atmospheric flow is now well 
understood. It imposes a limit on predictability, as 
unavoidable errors in the initial state grow rapidly and 
render the forecast useless after some days. As a result 
of our increased understanding of the inherent diffi- 
culties of making precise predictions, there has been 
a paradigm shift in recent years from deterministic 
to probabilistic prediction. A forecast is now consid- 
ered incomplete without an accompanying error bar, 
or quantitative indication of confidence. 

The most successful way of producing a probabilis- 
tic prediction is to run a series, or ensemble, of fore- 
casts, each starting from a slightly different initial state 
and each randomly perturbed during the forecast to 
simulate model errors. The ensemble of forecasts is 
used to deduce probabilistic information about future 
changes in the atmosphere. Since the early 1990s this 
systematic method of providing an a priori measure 
of forecast accuracy has been operational at both the 
ECMWF and at the National Centers for Environmen- 
tal Prediction in Washington. In the ECMWF’s ensem- 
ble prediction system, an ensemble of fifty-one fore- 
casts is performed, each having a resolution half that 


1. Full details of the system are available at www.wrf-model.org. 


of the deterministic forecast. Probability forecasts for a 
wide range of weather events are generated and dissem- 
inated for use in the operational centers, and they have 
become the key tools for medium-range prediction. 

7 Verification of ECMWF Forecasts 

Forecast accuracy has improved dramatically in recent 
decades. This can be measured by the anomaly corre- 
lation. The anomaly is the difference between a fore- 
cast value and the corresponding climate value, and 
the agreement between the forecast anomaly and the 
observed anomaly is expressed as the anomaly correla- 
tion. The higher this score the better; by general agree- 
ment, values in excess of 60% imply skill in the forecast. 
In figure 1, the twelve-month running mean anomaly 
correlations (in percentages) of the three-, five-, seven-, 
and ten-day 500 hPa height forecasts are shown for 
the extratropical Northern Hemisphere and Southern 
Hemisphere. The lines above each shaded region are 
for the Northern Hemisphere and the lines below are 
for the Southern Hemisphere, with the shading showing 
the difference in scores between the two. 

The plots in figure 1 show a continuing improvement 
in forecast accuracy, especially for the Southern Hemi- 
sphere. By the turn of the millennium, the accuracy was 
comparable for the two hemispheres. Predictive abil- 
ity has improved steadily over the past thirty years, 
and there is now accuracy out to eight days ahead. 
This record is confirmed by a wealth of other data. Pre- 
dictive skill has been increasing by about one day per 
decade, and there are reasons to hope that this trend 
will continue for several more decades. 

8 Applications of NWP 

NWP models are used for a wide range of applica- 
tions. Perhaps the most important purpose is to pro- 
vide timely warnings about weather extremes. Great 
financial losses can be caused by gales, floods, and 
other anomalous weather events. The warnings that 
result from NWP guidance can greatly diminish losses 
of both life and property. Transportation, energy con- 
sumption, construction, tourism, and agriculture are 
all sensitive to weather conditions. There are expec- 
tations from all these sectors of increasing accu- 
racy and detail in short-range forecasts, as decisions 
with heavy financial implications must continually be 
made. 

NWP models are used to generate special guidance 
for the marine community. Predicted winds are used to 
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drive wave models, which predict sea and swell heights 
and periods. Prediction of road ice is performed by 
specially designed models that use forecasts of tem- 
perature, humidity, precipitation, cloudiness, and other 
parameters to estimate the conditions on the road sur- 
face. Trajectories are easily derived from limited-area 
models. These are vital for modeling pollution drift, for 
nuclear fallout, smoke from forest fires, and so on. Avi- 
ation benefits significantly from NWP guidance, which 
provides warnings of hazards such as lightning, icing, 
and clear-air turbulence. 

9 The Future 

Progress in NWP over the past sixty years can be accu- 
rately described as revolutionary. Weather forecasts 
are now consistently accurate and readily available. 
Nevertheless, some formidable challenges remain. Sud- 
den weather changes and extremes cause much human 
hardship and damage to property. These rapid devel- 
opments often involve intricate interactions between 
dynamical and physical processes, both of which vary 
on a range of timescales. The effective computational 
coupling between the dynamical processes and the 
physical parametrizations is a significant challenge. 
Nowcasting is the process of predicting changes over 
periods of a few hours. Guidance provided by current 
numerical models occasionally falls short of what is 
required to take effective action and avert disasters. 
Greatest value is obtained by a systematic combina- 
tion of NWP products with conventional observations, 
radar imagery, satellite imagery, and other data. But 
much remains to be done to develop optimal now- 
casting systems, and we may be optimistic that future 
developments will lead to great improvements in this 
area. 

At the opposite end of the timescale, the chaotic 
nature of the atmosphere limits the validity of deter- 
ministic forecasts. Interaction between the atmosphere 
and the ocean becomes a dominant factor at longer 
forecast ranges, as does coupling to sea ice [V. 17]. 
Also, a more accurate description of aerosols and trace 
gases should improve long-range forecasts. Although 
good progress in seasonal forecasting for the tropics 
has been made, the production of useful long-range 
forecasts for temperate regions remains to be tack- 
led by future modelers. Another great challenge is the 
modeling and prediction of climate change, a matter of 
increasing importance and concern. 


Further Reading 

Lynch, P. 2006. The Emergence of Numerical Weather Pre- 
diction: Richardson’s Dream. Cambridge: Cambridge Uni- 
versity Press. 
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1 Introduction 

The general public’s appreciation of the danger of 
tsunamis has soared since the Indian Ocean tsunami 
of December 26, 2004, killed more than 200 000 peo- 
ple. Several other large tsunamis have occurred since 
then, including the devastating March 11, 2011, Great 
Tohoku tsunami generated off the coast of Japan. The 
international community of tsunami scientists has also 
grown considerably since 2004, and an increasing num- 
ber of applied mathematicians have contributed to the 
development of better models and computational tools 
for the study of tsunamis. In addition to its impor- 
tance in scientific studies and public safety, tsunami 
modeling also provides an excellent case study to illus- 
trate a variety of techniques from applied and com- 
putational mathematics. This article combines a brief 
overview of tsunami science and hazard mitigation 
with descriptions of some of these mathematical tech- 
niques, including an indication of some challenging 
problems of ongoing research. 

The term “tsunami” is generally used to refer to any 
large-scale anomalous motion of water that propagates 
as a wave in a sizable body of water. Tsunamis differ 
from familiar surface waves in several ways. Typically, 
the fluid motion is not confined to a thin layer of water 
near the surface, as it is in wind-generated waves. Also 
the wavelength of the waves is much longer than nor- 
mal: sometimes hundreds of kilometers. This is orders 
of magnitude larger than the depth of the ocean (which 
is about 4000 m on average), and tsunamis are there- 
fore also sometimes referred to as “long waves” in the 
scientific literature. In the past, tsunamis were often 
called “tidal waves” in English because they share some 
characteristics with tides, which are the visible effect 
of very long waves propagating around the Earth. How- 
ever, tsunamis have nothing to do with the gravitational 
(tidal) forcing that drives the tides, and so this term is 
misleading and is no longer used. The Japanese word 
“tsunami” means “harbor wave,” apparently because 
sailors would sometimes return home to find their 



V.19. Tsunami Modeling 


7 13 


harbor destroyed by mysterious waves they did not 
observe while at sea. Strong currents and vortices in 
harbors often cause extensive damage to ships and 
infrastructure even when there is no onshore inunda- 
tion. Although the worst effects of a tsunami are often 
observed in harbors, the effects can be devastating in 
any coastal region. Because tsunamis have such a long 
wavelength, they frequently appear onshore as a flood 
that can continue flowing inward for tens of minutes 
or even hours before flowing back out. The flow veloc- 
ities can also be quite large, with the consequence that 
even a tsunami wave with an amplitude of less than 
a meter can sweep people off their feet and do con- 
siderable damage to structures. Tsunamis arising from 
large earthquakes often result in flow depths greater 
than a meter, particularly along the coast closest to the 
earthquake, where run-up can reach tens of meters. 

Tsunamis are generated whenever a large mass of 
water is rapidly displaced, either by the motion of the 
seafloor due to an earthquake or submarine landslide, 
or when a solid mass enters the water from a land- 
slide, volcanic flow, or asteroid impact. The largest 
tsunamis in recent history, such as the 2004 and 2011 
events mentioned above, were all generated by mega- 
thrust subduction zone earthquakes at the boundary 
of oceanic and continental plates. Offshore from many 
continents there is a subduction zone where plates are 
converging. The denser material in the oceanic plate 
subducts beneath the lighter continental crust. Rather 
than sliding smoothly, stress builds up at the inter- 
face and is periodically released when one plate sud- 
denly slips several meters past the other, causing an 
earthquake during which the seafloor is lifted up in 
some regions and depressed in others. All of the water 
above the seafloor is lifted or falls along with it, cre- 
ating a disturbance on the sea surface that propa- 
gates away in the form of waves. See figure 1 for an 
illustration of tsunami generation and figure 2 for a 
numerical simulation of waves generated by the 2011 
Tohoku earthquake off the coast of Japan. This arti- 
cle primarily concerns tsunamis caused by subduction 
zone earthquakes since they are a major concern in risk 
management and have been widely studied. 

2 Mathematical Models and 
Equations of Motion 

Tsunamis are modeled by solving systems of partial 
differential equations (PDEs) arising from the theory 
of fluid dynamics. The motion of water can be very 
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Figure 1 Illustration of the generation of a tsunami by a 
subduction zone earthquake, (a) Overall, a tectonic plate 
descends, or “subducts,” beneath an adjoining plate, but 
it does so in a stick-slip fashion, (b) Between earthquakes 
the plates slide freely at great depth, where hot and duc- 
tile, but at shallow depth, where cool and brittle, they stick 
together; slowly squeezed, the overriding plate thickens, 
(c) During an earthquake the leading edge of the overriding 
plate breaks free, springing seaward and upward; behind, 
the plate stretches and its surface falls; the vertical displace- 
ments set off a tsunami. (Image courtesy of Brian Atwa- 
ter and taken from Atwater, B., et al. 2005. The Orphan 
Tsunami ofl 700— Japanese Clues to a Parent Earthquake in 
North America. Washington, DC: University of Washington 
Press.) 
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Figure 2 Propagation of the tsunami arising from the March 11, 2011, Tohoku earthquake off the coast of Japan at four 
different times: (a) 20 minutes, (b) 30 minutes, (c) 1 hour, and (d) 2 hours after the earthquake. Waves propagate away from 
the source region with a velocity that varies with the local depth of the ocean. Contour lines show sea surface elevation 
above sea level, in increments of 0.2 m (in (a) and (b): at early times) and 0.1 m (in (c) and (d): at later times). There is a wave 
trough behind the leading wave peak shown here, but for clarity the contours of elevation below sea level are not shown. 


well modeled by the navier-stokes equations [111.23] 
for an incompressible viscous fluid. However, these 
are rarely used directly in tsunami modeling since 
they would have to be solved in a time-varying three- 
dimensional domain, bounded by a free surface at the 


top and by moving boundaries at the edges of the 
ocean as the wave inundates or retreats at the shore- 
line. Fortunately, for most tsunamis it is possible to use 
“depth-averaged” systems of PDEs, obtained by inte- 
grating in the (vertical) z-direction to obtain equations 
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in two space dimensions (plus time). In these formu- 
lations, the depth of the fluid at each point is mod- 
eled by a function h(x,y,t) that varies with loca- 
tion and time. The velocity of the fluid is described 
by two functions u(x,y,t) and v(x,y,t) that repre- 
sent depth-averaged values of the velocity in the x- 
and y-directions, respectively. In addition to a reduc- 
tion from three to two space dimensions, this elimi- 
nates the free surface boundary in z; the location of 
the sea surface is now determined directly from the 
depth h(x, y, t). These equations are solved in a time- 
varying two-dimensional xy-domain since the moving 
boundaries at the shoreline must still be dealt with. 

A variety of depth-averaged equations can be derived, 
depending on the assumptions made about the flow. 
For large-scale tsunamis, the so-called shallow-water 
equations [III.27] (also called the Saint Venant or long- 
wave equations) are frequently used and have been 
shown to be very accurate. The assumption with these 
equations is that the fluid is sufficiently shallow rela- 
tive to the wavelength of the wave being studied. This is 
generally true for tsunamis generated by earthquakes, 
where the wavelength is typically 10-100 times greater 
than the ocean depth. 

The two-dimensional shallow-water equations have 
the form 


ht + ( hu) x + (hv) y = 0, 

(hu) t + ( hu 2 + \gh 2 ) x + ( huv) y = -ghB x , 


( 1 ) 


( hv)t + (huv)x + (hv 2 + \gh 2 ) y = -ghB y , 
where subscripts denote partial derivatives, e.g., h t = 
dh/dt. In addition to the variables h,u,v already intro- 
duced, which are each functions of (x,y, t), these equa- 
tions involve the gravitational force g and the topogra- 
phy or seafloor bathymetry (underwater topography), 
which is denoted by B(x,y). Typically, B = 0 repre- 
sents sea level, while B > 0 is onshore topography and 
B < 0 represents seafloor bathymetry. Water is present 
wherever h > 0, and q(x,y,t) = h(x,y,t) + B(x,y) 
is the elevation of the water surface. See figure 3 for 
a diagram in one space dimension. During an earth- 
quake, B should also be a function of t in the region 
where the seafloor is deforming. In practice it is often 
sufficient to include this deformation in B(x,y), while 
the initial conditions for the depth h(x, y , 0) are based 
on the undeformed topography. The seafloor deforma- 
tion then appears instantaneously in the initial surface 
q(x, y, 0), which initializes the tsunami. In the remain- 
der of this article, the term topography will be used for 
both B > 0 and B < 0 for simplicity. 



Figure 3 Illustration showing the notation 
used in the shallow-water equations (1). 


If B(x,y ) < 0 is constant (a flat bottom), then the 
“source terms” on the right-hand side of these equa- 
tions drop out and the equations model the conser- 
vation of mass (h) and momentum ( hu,hv ). Over a 
varying bottom, mass is still conserved but momentum 
is affected by the terrain, as seen in the reflection of 
waves at a shoreline, for example, and in partial reflec- 
tion when a wave interacts with underwater features. 
The term \gh 2 appearing in the momentum equations 
is the depth-averaged “hydrostatic pressure” in a col- 
umn of water of depth h. (This and all other terms in 
(1) should in fact also involve the fluid density p, but 
this cancels out everywhere if the density is assumed 
to be constant.) 

The equations (1) are a nonlinear system of equa- 
tions of hyperbolic type. Hyperbolic PDEs frequently 
arise when waves are modeled mathematically; the pro- 
totype is the wave equation [III.31] itself. The ampli- 
tude of a tsunami in the deep ocean is generally very 
small relative to the water depth, typically less than a 
meter even for a large megathrust tsunami. Away from 
the coast these equations could be approximated by 
linearized equations with variable coefficients arising 
from the varying topography. Near the shore, however, 
the amplitude of the wave is large relative to the depth 
of the fluid and the full nonlinear equations must be 
used to accurately model the interaction of a tsunami 
with the nearshore topography and the onshore inun- 
dation that occurs. Solutions to nonlinear hyperbolic 
PDEs canbecome discontinuous if a shock [11.30] devel- 
ops. In the case of the shallow-water equations, a shock 
is also called a “hydraulic jump” and is a mathemati- 
cal idealization of a thin zone in which the depth and 
velocity both undergo rapid transitions from one value 
to another. Such regions frequently appear as a tur- 
bulent wave front (sometimes called a turbulent bore) 
once the tsunami moves into sufficiently shallow water. 
The shallow-water equations do not model this turbu- 
lent zone directly, but they are frequently adequate to 
capture important quantities such as the depth and 
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fluid velocities behind the bore and its propagation 
speed. 

3 Uses of Tsunami Modeling 

The PDEs describing a tsunami cannot be solved 
exactly, in general, and numerical methods must there- 
fore be used to simulate the propagation and inunda- 
tion of a tsunami. A brief description of how this might 
be done and some of the challenges that arise is given 
in section 4, but first we motivate the need for numer- 
ical models by describing some common uses of such 
models. 

3.1 Real-Time Warning Systems 

One natural use of a numerical model is to assist 
in issuing warnings in real time as a tsunami propa- 
gates across the ocean and to determine which coastal 
regions should be evacuated. There are many chal- 
lenges to doing this quickly and accurately. Accurate 
assessment is critical not only to ensure that people in 
areas that are at risk are properly warned but also to 
avoid triggering evacuation in areas where it is not nec- 
essary, which can itself cause loss of life, have a serious 
financial impact, and decrease the likelihood that atten- 
tion will be paid to future warnings. For a subduction 
zone megathrust earthquake it is often impossible to 
issue tsunami warnings quickly enough for areas along 
the nearby coastline. The tsunami may arrive in less 
than an hour, and it is critical that residents understand 
the need to move to high ground when a major earth- 
quake occurs. On the other hand, across the ocean the 
earthquake itself is not felt and so provides no direct 
indication of an impending tsunami, but several hours 
are available in which to perform simulations and issue 
warnings. 

3.2 Tsunami Source Inversion 

To perform tsunami simulations it is necessary to esti- 
mate the source, i.e., the deformation of the sea floor 
that generates the tsunami, since this determines the 
initial conditions that are used to numerically solve 
the PDEs modeling tsunami propagation and inunda- 
tion. There is generally no way to measure this directly, 
and so some form of inverse problem [IV. IS] must be 
solved to obtain an estimate of the deformation based 
on measurements that can be made, such as seismome- 
ter recordings of the earthquake or measurements of 
the tsunami itself. Initial estimates of the location and 


magnitude of an earthquake generally come from ana- 
lyzing recordings of seismic waves, which are compres- 
sion and shear waves that travel through the Earth with 
much higher velocity than tsunamis and that are rou- 
tinely recorded at hundreds of seismometers widely 
scattered around the world. From the measured wave- 
forms at many locations it is possible to construct an 
estimate of how the Earth must have moved to pro- 
duce this set of data. This relies ultimately on solv- 
ing an inverse problem for the PDEs modeling wave 
motion in elastic materials. Seismic inversions gener- 
ally estimate the slip of the Earth along the earthquake 
fault, which may be tens of kilometers below the sea 
floor. Converting this slip on the fault plane to defor- 
mation of the sea floor requires solving another elastic- 
ity problem, whose solution is often approximated by 
the Okada model. This is based on the Green function 
for the deformation of the boundary of an elastic half- 
space caused by a delta function dislocation. Integrat- 
ing this over a finite-sized patch of a fault plane gives 
an estimate of the resulting seafloor displacement. 

While the results of seismic inversions are invaluable 
in modeling tsunamis, performing an accurate inver- 
sion requires collecting and processing a large amount 
of data and this may not be feasible in real time. In 
order to gather better information about tsunamis as 
they propagate, a number of pressure gauges have 
recently been deployed on the seafloor that are able to 
measure water pressure extremely accurately. From the 
hydrostatic pressure it is possible to estimate the depth 
of the water at these locations with enough precision to 
capture variations due to a tsunami passing by. Direct 
measurement of a tsunami at one or more of these 
gauges can then be combined with seismic models iden- 
tifying the approximate source location and geophysi- 
cal knowledge of the faults that are most likely to pro- 
duce tsunamis. This information, together with accu- 
rate tsunami propagation models, allows an inverse 
problem to be solved that in turn enables us to estimate 
the seafloor deformation that caused the tsunami more 
accurately and quickly than is possible using seismic 
information alone. 

3.3 Hazard Modeling and Mitigation 

Real-time simulations of tsunamis are used to issue 
warnings, but tsunami modeling has many ongoing 
uses beyond this. Protecting communities requires ade- 
quate planning long before a tsunami takes place, and 
tsunami models are used to simulate the effect of 



V.19. Tsunami Modeling 


717 


tsunamis arising from hypothetical earthquake events. 
The results of such models can be used to determine 
what regions of a community are most at risk and what 
regions can be designated as safe zones for evacuation. 
Modeling the arrival time and pattern of the waves can 
be used in connection with traffic-flow models of evac- 
uation. Some communities in tsunami-prone regions 
build sea walls or gates that can be closed for pro- 
tection against tsunamis, or build “vertical evacuation 
structures” in regions where there is no easily accessi- 
ble high ground for large-scale evacuation. These struc- 
tures may take the form of multiuse buildings built 
to withstand tsunamis and tall enough that the upper 
floors are safe havens, or they may consist of large 
berms that form artificial high ground. Designing such 
structures requires modeling the flow depth and often 
also the fluid velocities of hypothetical tsunamis. 

Of course it is impossible to know exactly what the 
seafloor deformation will be for future earthquakes, 
but quite a bit is known about the major subduction 
zones and the likely locations and magnitudes of large 
earthquakes based on geology and past history. There 
is always a question of how large a tsunami one should 
design for. Sometimes an estimate of the “credible 
worst case” tsunami for that location is used, but this 
may correspond to an event with very low probability 
of occurrence that would require enormous expendi- 
ture to protect against — money that might be better 
spent protecting against more likely events at addi- 
tional locations. To better understand these trade-offs, 
there has recently been increased interest in probabilis- 
tic tsunami hazard assessment, in which a set of possi- 
ble events are assigned probabilities or an entire spec- 
trum of possible events is assigned some probability 
density function, typically over a very high- dimensional 
stochastic space. The goal is then to obtain from this 
a probabilistic description of the resulting inundation 
patterns, flow depths, velocities, etc. This is a form of 
uncertainty quantification [11.34], a rapidly grow- 
ing field of importance in many fields of computational 
science where simulations are based on many uncertain 
inputs and the goal is a probabilistic description of the 
resulting outputs rather than a single simulation result. 
Applied mathematicians and statisticians have a large 
role to play in the development of new techniques to 
efficiently solve these problems. 

3.4 The Study of Past Tsunamis and Earthquakes 

Another major use of tsunami modeling is the study 
of past tsunamis. A wealth of data has been collected 


following recent tsunami events by “tsunami survey 
teams” that measure inundation and runup along 
affected coasts. There are also data available from 
seafloor pressure gauges, tide gauges along the coast, 
and other data-collection facilities. Models of the sea- 
floor deformation produced by solving the source 
inversion problem can then be used as initial data for 
tsunami models and the computed results compared 
with measurements. Such studies are important in ver- 
ifying that a tsunami model gives a sufficiently accurate 
approximation to a real tsunami that it can be used with 
confidence for warning or hazard mitigation purposes. 
Validated models are also used in performing tsunami 
source inversion to estimate the seafloor deformation, 
and this can give additional insight into the earthquake 
mechanism that is useful to seismologists. Tsunami 
models can also help explain unusual features of past 
events by providing a laboratory for exploring the fluid 
dynamics taking place during the event. 

Tsunami models can also help reconstruct events 
that happened in the more distant past, for which 
there are no pressure gauge or tide gauge data and 
perhaps only limited historical records of the regions 
inundated, or no human records in the case of pre- 
historic events or those that occurred on uninhabited 
coastlines. Luckily, for many events a geological record 
of the tsunami inundation is recorded in the form of 
tsunami deposits. As a tsunami approaches shore it typ- 
ically becomes turbulent and picks up sediment from 
the seafloor, such as sand and marine microorganisms. 
This material is carried inland during the flooding stage 
and typically settles out of the flow as the flow decel- 
erates and reverses, leaving behind a layer of deposits, 
often far inland. In tsunami-prone areas there are often 
many layers of tsunami deposits that have been built 
up over thousands of years, separated by layers of 
soil that slowly build up between tsunamis. Core sam- 
ples or trenches can reveal many past events that can 
often be dated using radiocarbon dating of organic 
matter or interspersed tephra layers from known vol- 
canic eruptions. The study of tsunami deposits is a 
major source of information about the magnitude, loca- 
tion, and recurrence times of past earthquakes. This 
information is critical in developing probabilistic mod- 
els of tsunami or earthquake hazards, as well as to 
obtaining a better scientific understanding of earth- 
quake processes. Numerical tsunami models can be 
used to help identify the location and magnitude of 
seafloor deformation that would lead to the patterns 
of inundation recorded by tsunami deposits. Models 
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that include sediment erosion, transport, and deposi- 
tion are also being used to better understand the fluid 
dynamics of the creation of tsunami deposits, and this 
will ultimately lead to more accurate descriptions of 
the tsunamis that caused observed deposits. 

4 Numerical Modeling 

Systems of nonlinear PDEs such as the nonlinear 
shallow-water equations (1) typically cannot be solved 
exactly except for very simple cases: a one-dimensional 
wave on a linear beach for example. Realistic tsunami 
modeling always relies on numerical solution of the 
PDEs. This requires discretizing the equations in some 
manner: replacing the differential equations describ- 
ing the continuum solution (defined for all (x,y, t) in 
some domain) by a finite set of discrete algebraic equa- 
tions whose solution can be computed in finite time on 
a computer. There are many ways to do this, and gen- 
eral discussions of numerical solution of PDEs are given 
in NUMERICAL SOLUTION OF PARTIAL DIFFERENTIAL 
EQUATIONS [IV.13]. 

Finite-difference methods are often used, in which a 
discrete grid is introduced consisting of a finite num- 
ber of grid points ( Xi,yj ) covering the domain, and the 
solution is approximated only at these points at a dis- 
crete set of times to, f l , ^2 Derivatives in the PDE 

are replaced by finite-difference approximations based 
on the approximate solution at neighboring grid points, 
giving a discrete set of algebraic equations that can 
be solved on a computer. Another popular approach 
is to use a finite-volume method, in which the domain 
is subdivided into a finite number of grid cells and 
the approximate solution consists of average values of 
the solution over each grid cell. Integrating the PDEs 
over a grid cell gives an expression for the time deriva- 
tive of the cell average that can be used to update the 
cell averages from one time t n to the next time t n+ 1 . 
To obtain better accuracy, methods in which the solu- 
tion on each grid cell is approximated by a polyno- 
mial rather than by only the cell average (which can 
be interpreted as a constant function, or polynomial of 
degree 0, over each cell) are sometimes used. In this 
case, the higher-order coefficients of each polynomial 
must be updated from one time step to the next. A 
method of this type that has recently become popu- 
lar for tsunami modeling is the discontinuous Galerkin 
method, in which the piecewise polynomial function 
obtained from the polynomials defined on each cell is 
not assumed to be continuous at the interface between 


one cell and its neighbor. The term “Galerkin” refers 
to a finite-element approach to deriving equations for 
evolving the polynomial coefficients in time. 

4.1 Nonlinearity and Shock Formation 

A prominent feature of nonlinear hyperbolic PDEs is 
that shocks can form in the solutions: shocks are dis- 
continuities in the depth and velocity that can arise 
even from smooth initial conditions. As mentioned in 
section 2, these correspond to hydraulic jumps or bores 
that are seen in tsunamis as they approach the shore. 
Sharp discontinuities are only an approximation of the 
true behavior, but they often give a good approximation 
of the flow. Incorporating more accurate fluid dynamics 
models would lead to systems of PDEs that are much 
more computationally expensive to solve. 

The presence of discontinuities in the solution can 
lead to difficulties in solving the PDEs numerically, 
since derivatives are infinite at a point of discontinu- 
ity, and finite-difference approximations to derivatives 
generally diverge. This has led to the increased popu- 
larity of both finite-volume and discontinuous Galerkin 
methods, which are better able to robustly capture dis- 
continuities in the solution. Methods designed to do 
this well are often called shock-capturing methods. 

4.2 Inundation and the Moving Shoreline 

Another computational challenge in modeling tsuna- 
mis, or any other geophysical flow over topography, is 
the need to handle the moving boundary of the flow at 
the shoreline. Many early tsunami models did not cap- 
ture this moving boundary at all. Instead, the equations 
were solved over a fixed domain defined by the origi- 
nal shoreline with some boundary conditions imposed 
at this fixed boundary, such as an impermeable wall. 
While this approach could not be used to model inun- 
dation directly, it could still give some indication of 
the tsunami runup based on recording the depths and 
velocities along this wall boundary. Other mathemat- 
ical or physical models were then used to estimate 
inundation from these values. 

Most recently developed tsunami models attempt to 
model inundation directly. For simple problems it may 
be possible to use a grid that moves with time so that 
one edge of the grid is always along the shoreline. For 
realistic problems this is generally infeasible, since the 
shoreline can be very complex and can break into pieces 
as islands or isolated pools of water form. Most tsunami 
models instead use a fixed grid and implement some 
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form of wetting and drying algorithm to keep track of 
which grid points or cells are dry (h = 0) and which 
are wet ( h > 0). Standard approaches to approximating 
the PDEs typically break down near the shoreline, and 
it is a major challenge in developing tsunami models 
to deal with this phenomenon robustly and accurately, 
particularly since this is often the region of primary 
interest in terms of the model results. 

4.3 Mesh Refinement 

Another challenge arises from the vast differences in 
spatial scale between the ocean over which a tsunami 
propagates and a section of coastline such as a harbor 
where the solution is of interest. In the shoreline region 
it may be necessary to have a fine grid, v\ith perhaps 
10 m or less between grid points, in order to resolve 
the flow at a scale that is useful. It is clearly impracti- 
cal, and luckily also unnecessary, to resolve the entire 
ocean to this resolution. The wavelength of a tsunami 
is typically more than 100 km, so the grid point spac- 
ing in the ocean can be more like 1-10 km. Moreover, 
we need even lower resolution over most of the ocean, 
particularly before the tsunami arrives. 

To deal with the variation in spatial scales, virtually 
all tsunami codes use unequally spaced grids, often by 
starting with a coarse grid over the ocean and then 
refining portions of the grid to higher resolution where 
needed. Some models only use static refinement, in 
which the grid does not change with time but has finer 
grids in regions of interest along the coast. Other com- 
puter codes use adaptive mesh refinement, in which 
the regions of refinement change with time to adapt to 
the evolving solution. For example, areas of refinement 
might be used to follow the propagating wave with a 
finer grid than is used over the rest of the ocean, and 
additional levels of refinement added near the coastal 
region of interest only when the tsunami is approaching 
shore. 

A related issue is the choice of time steps for advanc- 
ing the solution. Stability conditions generally require 
that the time step multiplied by the maximum wave 
speed should be no greater than the width of a grid 
cell. This is because the explicit methods that are typ- 
ically used for solving hyperbolic PDEs, such as the 
shallow-water equations, update the solution in each 
grid cell based only on data from the neighboring cells 
in each time step. If a wave can propagate more than 
one grid cell in one time step, then the method becomes 
unstable. This necessary condition for stability is called 


the cfl condition [V.18 §4.2], after fundamental work 
on the convergence of numerical methods by Courant, 
Friedrichs, and Lewy in the 1920s. For the shallow- 
water equations the wave speed is ^fgh , which varies 
dramatically from the shoreline, wlrere h » 0, to the 
deepest parts of the ocean, where h can reach 10 000 m. 
Additional difficulties arise in implementing an adap- 
tive mesh refinement algorithm: if the grid is refined 
in part of the domain by a factor of ten, say, in each 
spatial dimension, then typically the time step must 
also be decreased by the same factor. Hence, for every 
time step on the coarse grid it is necessary to take ten 
time steps on the finer grid, and information must be 
exchanged between the grids to maintain an accurate 
and stable solution near the grid interfaces. 

4.4 Dispersive Terms 

In some situations, tsunamis are generated with short 
wavelengths that are not sufficiently long relative to 
the fluid depth for the shallow-water equations to be 
valid. This most frequently happens with smaller local- 
ized sources such as a submarine landslide rather than 
with large-scale earthquakes. In this case it is often still 
possible to use depth-averaged two-dimensional equa- 
tions, but the equations obtained typically include addi- 
tional terms involving higher-order derivatives. These 
are generally dispersive terms that can better model the 
observed effect that waves with different wavelengths 
propagate at different speeds. 

The introduction of higher-order derivatives typi- 
cally requires the use of implicit methods to efficiently 
solve the equations, since the stability constraint for 
an explicit method generally requires a time step that 
is much smaller than is desirable. Implicit methods 
result in an algebraic system of equations that need 
to be solved at each time step, coupling the solu- 
tion at all grid points. This is typically much more 
time-consuming than an explicit method. 

Further Reading 

See Bourgeois (2009) for a recent survey of tsunami 
sedimentology and Geist et al. (2009) for a general 
introduction to probabilistic modeling of tsunamis. 
Some detailed descriptions of numerical methods for 
tsunami simulation can be found, for example, in 
Giraldo and Warburton (2008), Grilli et al. (2007), Kowa- 
lik et al. (2005), LeVeque et al. (2011), Lynett et al. 
(2002), and Titov and Synolakis (1998). The use of 
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V.20 Shock Waves 

C. J. Chapman 


1 Introduction 

A shock wave is a sudden or even violent change 
in pressure occurring in a thin layer-like region of a 
continuous medium such as air. Other properties of 
the medium also change suddenly: notably the veloc- 
ity, temperature, and entropy. In a strong shock, pro- 
duced for example by a spacecraft while still in the 
Earth’s atmosphere, the increase in temperature is 
great enough to break up the molecules of the air and 
produce ionization. Shock waves that are “weak” in the 
mathematical sense (i.e., those that produce a pressure 
change that is only a minute fraction of atmospheric 
pressure) are, unfortunately, extremely strong in the 


subjective sense because of the exquisite sensitivity 
of our hearing. This fact is of great importance for 
civil aviation in limiting the possible routes of super- 
sonic aircraft, which produce shock waves with focus- 
ing properties and locations that depend sensitively on 
atmospheric conditions. 

In the modeling of a shock wave, the thin region of 
rapid variation may nearly always be replaced by a sur- 
face across which the properties of the medium are 
regarded as discontinuous. The governing equations 
of motion, representing conservation of mass, momen- 
tum, and energy, still apply if they are expressed in 
integrated form, and their application to a region con- 
taining the surface of discontinuity leads to simultane- 
ous algebraic equations relating the limiting values of 
physical quantities on opposite sides of this surface. 
In addition, the entropy of the medium must increase 
on passage through the shock. The resulting algebraic 
relations, in conjunction with the partial differential 
equations of motion applied everywhere except on the 
shock surface, suffice for the solution of many impor- 
tant practical problems. For example, they determine 
the location and speed of propagation of the shock 
wave, which are usually not given in advance but have 
to be determined mathematically as part of the process 
of solving a problem. 

The theory of shock waves has important military 
applications, especially to the design of high-speed mis- 
siles and the properties of blast waves produced by 
explosions. On occasion, this has stimulated world- 
class mathematicians (and even philosophers) to make 
contributions to the subject that have turned out to 
be enduringly practical. Ernst Mach, who with Peter 
Salcher in the 1880s was the first person to photograph 
a shock wave, explained the properties of the high- 
speed bullets that had been used in the Franco-Prussian 
war (1870-71), and in World War II John von Neumann 
analyzed the effect of blast waves on tanks and build- 
ings, obtaining the surprising result that a shock wave 
striking obliquely exerts a greater pressure than if it 
strikes head-on. Richard von Mises and Lev Landau also 
worked on shock waves in this period, as well as fluid 
and solid mechanics “full-timers” such as Geoffrey I. 
Taylor and Theodore von Karman. 

2 Mathematical Theory 

2.1 Mass, Momentum, and Energy 

For definiteness, consider a shock wave in air. First we 
present the equations that apply to a normal shock, 
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through which the air flows at right angles (see also 
shocks [11.30]). Later we consider an oblique shock, for 
which arbitrary orientations (within v\ide limits) are 
possible between the incoming and outgoing air, and 
the shock itself. 

Three jump conditions across a normal shock are 
PlUl = pzuz, 
pi + p\u\ = P2 + Pzu\, 

P 112 P2 1 ? 

£ 1 + + — £2 

Pi 1 Pi 1 

The subscripts 1 and 2 here refer to opposite sides of 
the shock, and the frame of reference is that in which 
the shock is at rest. The variables are the densities 
pi, p2', the velocity components U\, M2; the pressures 
p 1 , P 2 ; and the internal energies per unit mass e\, ez- 
All quantities may be taken to be positive, with the air 
flowing in from side 1 and out from side 2. 

The first of the above equations represents conserva- 
tion of mass, since the left-hand side is the mass of air 
per unit area per unit time flowing into the shock and 
the right-hand side is the corresponding quantity flow- 
ing out. The second equation accounts for the change 
in the momentum of air per unit volume from p\U\ 
to P2W2. This change occurs at a rate proportional to 
the velocity, giving the terms p\u\, P2U2, and is driven 
by the net force arising from the jump in pressure, 
giving the terms pi, p 2. The third equation accounts 
for the change in the energy of the air per unit mass 
from e\ + \u\ to ez + comprising internal energy 
and kinetic energy. The forces that produce this change 
in energy come from the pressure and give the terms 
P1/P1, Pzl Pz- It is convenient to combine each inter- 
nal energy and pressure term into a single quantity: 
the enthalpy. Then, in terms of the enthalpies h 1, hz, 
defined by hi = e\ + pi/pi and hz = ez + Pzl Pz, the 
energy equation is 

h\ + \u\ = hz + ^u 2 . 

2.2 Entropy 

A fourth jump condition across a shock concerns the 
entropy, which is a measure of the energy in the dis- 
ordered molecular motions that is not available to per- 
form work at the macroscopic scale. The condition is 
that the entropy of the material passing through the 
shock must increase. Although the entropy condition 
is “only” an inequality, it is of fundamental importance 
in determining the type of shocks that can or cannot 
occur. For example, the condition shows that, in almost 


all materials, a rarefaction shock (at which the pres- 
sure suddenly falls) is impossible. A shock in air there- 
fore has the property that on passing through it the 
air undergoes an increase in pressure, density, temper- 
ature, internal energy, and enthalpy and a decrease in 
velocity. 

For air in conditions that are not too extreme, the 
entropy per unit mass per degree on the absolute tem- 
perature scale is proportional to In (p/ p y ), where y is 
the ratio of specific heats, which is approximately a con- 
stant. Thus, with the sign convention that uz and u 1 in 
the jump conditions are positive, the entropies Sz and 

51 on opposite sides of the shock satisfy the condition 

52 > 5i , so that 

Pz/Pz > PilPi- 

2.3 The Rankine-Hugoniot Relation 

Simple algebraic manipulation of the jump conditions 
gives 

hz - hi = if — + — )(p 2 - Pi). 

2 Vpi pzJ 

This is an especially useful relation because h 2 and 
hi are functions of ( P 2 ,Pz ) and (pi,pi), respectively. 
Thus, if (pi.pi) are fixed, representing known up- 
stream conditions, the Rankine-Hugoniot relation gives 
the shock adiabatic, relating pz to pz, i.e., the pressure 
and density downstream. 

2.4 The Mach Number 

A basic property of a normal shock, deducible from 
the four jump conditions, is that the flow into it is 
supersonic, i.e., faster than the speed of sound, and the 
flow out of it is subsonic, i.e., slower than the speed of 
sound. This provides a strong hint that we should use 
variables based on the incoming and outgoing Mach 
numbers, Mi and M 2 , defined as the ratio of the flow 
speeds to the local speed of sound, denoted by ci and 
C 2 ■ Thus 

, , U\ _ __ U2 

Mi = — > 1 , Mo = — < 1 . 

Ci cz 

We also have available the equation of state for air, 
p = pRT, and the innumerable formulas of thermo- 
dynamics, from which we select merely 

c 2 = yRT = yp/p = (y - 1 )h. 

Here R is the gas constant, T is absolute temperature, 
and the formulas apply with subscript 1 or subscript 
2 (but the same values of y and R) on each side of 
the shock. We now have simultaneous equations in 
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abundance, and a little algebra starting from the jump 
conditions gives 


M 


2 

2 


1+ f(y-l)M 2 
yM x 2 -i(y- 1)' 


2.5 Pressure and Temperature Rise 


With the aid of the relation between M| and Mf , the 
change in any quantity at a normal shock may read- 
ily be expressed in terms of M 2 and y alone. Such a 
representation is highly practical in calculations. For 
example, an important aspect of a shock wave is that 
it can produce considerable increases in pressure and 
temperature. But how do these increases depend on the 
Mach number of the incoming flow? The answer for a 
shock wave in air is 

P2 = \+ 2y (Ml 

Pi y+1 

and 


1 ) 


72 = (1 + ;(y ~ DM^HyMf - j(y - 1)) 

Ti i(y+l) 2 M 2 

The functional form of this last equation is important 
for modeling purposes. Because the specific-heat ratio 
y is of “order one” (it is approximately 1.4 for atmo- 
spheric air), the equation shows that at high Mach num- 
bers the absolute temperature ratio T 2 /T 1 scales with 
M 2 . Thus, during the acceleration phase of the flight of 
a hypersonic vehicle, the simple equation of state we 
have been using, p = pRT, ceases to apply long before 
operating speed is attained. Physical processes beyond 
the reach of this “ideal gas equation” must be included, 
and the subject of shock waves rapidly starts to overlap 
with the chemistry of the breakup of the molecules and 
with the physics of ionized gases. Applied mathemati- 
cians such as Sir James Lighthill and John F. Clarke have 
played an important role in developing the required 
theory of such strong shock waves. 


3 Oblique Shock Waves 

3.1 Usefulness in Modeling 

Oblique shock waves provide a versatile tool for mod- 
eling the flow around a supersonically moving body 
of any shape. The reason for this is that the local 
change in flow direction produced by the surface of the 
body always produces an oblique shock somewhere in 
the flow. A concave corner, for example, produces an 
oblique shock attached to the corner, and the apex of a 
forward-facing wedge produces oblique shocks on each 
side of the wedge, attached to the apex. 



Figure 1 Weak shock waves (Mach surfaces) 
in a supersonic nozzle. 


Moreover, a smooth concave part of the body sur- 
face produces a flow field in which disturbances prop- 
agate on surfaces whose orientation is that of oblique 
shocks of vanishingly small strength. These Mach sur- 
faces come to a focus a short distance away from a 
concave boundary to produce a full-strength oblique 
shock. In general, a Mach surface is a surface on which 
a weak disturbance in pressure and associated quan- 
tities can propagate in accordance with the equations 
of acoustics as a Mach wave. The theory of weak shock 
waves is subsumed under acoustic theory in such a way 
that a weak shock corresponds to a characteristic sur- 
face of the acoustic wave equation. Shocks and charac- 
teristic surfaces do not in general coincide, but they do 
in the limit of zero shock strength. 

Oblique shock waves are fundamental to the mod- 
eling of shock wave intersections and reflections. In a 
fluid, shock intersections produce vorticity, localized 
in surfaces called slip surfaces, across which there is 
a jump in the tangential component of velocity. Shock 
reflections can be of different types, depending on the 
orientation and strength of the incoming shock. A com- 
mon type is the Mach reflection, in which a nearly 
straight shock wave, known as the stem, extends from 
the body surface to a triple intersection of shocks, from 
which there also emerges a slip surface. 

Many excellent photographs of oblique shock wave 
patterns may be found in Van Dyke (1982). Figures 1 
and 2 are examples of science as art. 

3.2 Flow Deflection 

The most basic question one can ask about an oblique 
shock is the following: if the incoming flow is at Mach 
number M 1 and the flow is deflected by an angle 0 from 


V.20. Shock Waves 


723 



Figure 2 Reflection of a shock wave by a wedge, and 

rolling up of the vortex sheets (slip surfaces) produced. 

its original direction, what are the possible oblique 
shock angles <fi that can bring this deflection about? 
The angle <p is that between the shock surface and 
the incoming flow, so that a normal shock corresponds 
to 0 = 0 and <p = tt/2. The question is readily 
answered by starting with the formulas for a normal 
shock and superposing a uniform flow parallel to the 
shock surface. A little algebra and trigonometry give 

„ (M, 2 sin 2 <b — 1) cot <b 

tan 0 = , ^ = . 

1 + (|(y + 1) - sim 4>)Ml 

At a single value of M \ , the graph of 6 against <p rises 
from 0 = 0 at <p = sin -1 (1 /Mi) to a maximum 0 = 
0max(A/i) at a higher value of </>, before falling back 
to 0 = 0 at <p = it / 2. Thus the shock angle and flow 
deflection are limited to definite intervals. The angle 
shT 1 ( 1 /Mi ) is the Mach angle and gives the Mach wave 
described above. 

As Mi is varied, there is a maximum possible value 
of Umax (Mi ) , attained in the limit of large Mi , and there 
is a corresponding value of <p. For air in conditions that 
are not too extreme, with a ratio of specific heats close 
to y = 1.4, we obtain the useful result that the greatest 
possible flow deflection at an oblique shock is 46°, and 
the corresponding shock angle is 68°. 

For given Mi, and a deflection angle in the allow- 
able range, there are two possible shock angles <fi. The 
smaller angle gives the weaker shock, and, except in a 
narrow range of deflection angles just below 0 m ax(Mi ), 
a supersonic outflow; the outflow from the stronger 
shock, at the larger angle, is always subsonic. Thus, 
although the inflow must be supersonic, the outflow 
may be subsonic or supersonic, depending on the mag- 
nitude of the equivalent superposed uniform flow on a 


normal shock referred to above. The ultimate explana- 
tion of such matters is the entropy condition. Whether 
or not the outflow is supersonic is important in prac- 
tice because a supersonic outflow may contain further 
shock waves. Which shock pattern actually occurs may 
depend on the downstream conditions. 

4 Dramatic Examples of Shock Waves 

4.1 Asteroid Impact 

The Earth has a long memory: the shock wave pro- 
duced by the asteroid that extinguished the dinosaurs 
(and hence led to human life, via the rise of mammals) 
left a signature that is still visible in the Earth’s crust. 
The asteroid struck Earth 66 million years ago near 
Chicxulub in the Yucatan Peninsula in Mexico, produc- 
ing a crater 180 km in diameter. Besides generating a 
tsunami [V.19] in the oceans, which mathematically 
speaking evolves into a type of shock wave near the 
shore, it generated shock waves in the rocks of the 
Earth’s crust. These shock waves propagated around 
the Earth, triggering an abundance of violent events, 
such as earthquakes and the eruption of volcanoes. 

Geologists have determined which metamorphic fea- 
tures of rock can be produced only by the intense pres- 
sure in the shock waves from a meteorite or aster- 
oid, one example being the patterns of closely spaced 
parallel planes that are visible under a microscope in 
grains of shocked quartz. The mathematical theory of 
shock waves in solids, such as metal and rock, is highly 
developed. 

4.2 The Trinity Atomic Bomb Test 

The world's first nuclear explosion took place in July 
1945, in the desert of New Mexico in the United 
States, when the U.S. army tested an atomic bomb 
with the code name Trinity. The explosion was Aimed, 
and the Aim, which showed the hemispherical shock 
wave rising into the atmosphere, was released not long 
afterward. 

For all the complexity of the physical processes and 
fluid dynamics in the explosion, it is remarkable that a 
simple scaling law describes to high accuracy the radius 
r of the shock wave as a function of time t after deto- 
nation, in which the only parameters are the explosion 
energy E and the initial air density p. The scaling law is 
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where C is a constant close in value to 1. With the aid of 
this scaling law, and a logarithmic plot, G. I. Taylor was 
able to determine the energy of the explosion rather 
accurately using only the film as data. 

4.3 Shock Wave Lithotripsy 

A medical treatment for a kidney stone is to break up 
the stone by applying a pulsed sound wave of high 
intensity that is brought to a focus at the stone. The 
focusing causes the pulses to become shock waves, in 
which the sudden rise in pressure provides the required 
disintegrative force on the stone. The machine used 
is called a lithotripter; it is designed so that the fre- 
quency and intensity of the shock waves can be con- 
trolled and varied as the treatment progresses. A water 
bath is applied to the patient’s back, so that the pulse 
propagates through water and then tissue. 
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V.21 Turbulence 

Julian C. R. Hunt 


1 An Introduction to the Physical and 
Mathematical Aspects of Turbulence 

Turbulence in artificially and naturally generated fluid 
flows consists of a wide range of random unsteady eddy 
motions that differ over varying sizes and frequencies, 
examples being those observed in the wake of a model 
building in a wind tunnel and in clouds (figure 1). In 
the eddies with the greatest energy, the velocity fluctu- 
ations are of order uq and their typical size is of order 
Lo (figure 2). 

Turbulent flows differ greatly from laminar flows 
[IV.28], which are stable and predictable even if they are 
unsteady. The two kinds of flow depend differently on 
the kinematic viscosity v, which is a property of fluids 
that is independent of the velocity and is determined by 
very small molecular motions on scales unrelated to the 
scale of the flow. The study of turbulence begins with 
defining, mathematically and conceptually, first the sta- 
tistical framework for the kinematics and dynamics of 


(a) 



Figure 1 (a) Random edges of turbulent clouds, (b) Tur- 
bulence in the wake of a model building in a wind tunnel, 
showing variations in random motions of fluid particles. 
(Courtesy of D. Hall.) 

random flow fields and then the mechanisms for tran- 
sition processes between laminar and turbulent flows, 
which occur when the dimensionless Reynolds number 
Re = uoLo/v is large enough for the fluctuating viscous 
stresses to be small. The goal of turbulence research is 
not, realistically, to find an overall theory but rather 
to describe and explain characteristics, patterns, and 
statistical properties found in different types of tur- 
bulent flows or particular regions of turbulent flows. 
Remarkably, the smallest-scale eddy motions have an 
approximately universal statistical structure. 

An equally significant but more practical result is 
the Prandtl-Karman theory, which states that the mean 
velocity near all kinds of resistive surfaces has a 
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Figure 2 Measurements of turbulent velocities in a wind 
tunnel at Re ~ 10 5 (courtesy of Z. Warhaft: taken from 
“Passive scalars in turbulent flows” (Annual Review of Fluid 
Mechanics 32:203-40, 2000)). 

logarithmic profile. Turbulence concepts apply in engi- 
neering and environmental flows, as well as in oceans 
and planetary atmospheres, where the eddy sizes range 
over factors of hundreds to thousands, while in the 
interior of the Earth, on other planets, in stars, and even 
in the far constellations, the range can extend to fac- 
tors of millions or more. Partially turbulent motions 
are observed in the flows along air passages and in 
the larger blood vessels of humans and large mam- 
mals, where they greatly influence the vital transport 
of liquids, gases, and particles. 

The fluid media in which turbulence occurs may 
consist of liquids, gases, or multiphase mixtures with 
droplets, bubbles, or particles. Some complex fluid 
media behave like fluids only under certain conditions 
and for limited periods of time, examples being low- 
density ionized gases in space or mixtures of solids and 
fluids in mud and volcanoes. 

In the early twentieth century, leading scientists and 
mathematicians including Ludwig Prandtl, G. I. Taylor, 
Theodore von Karman, and Lewis Fry Richardson estab- 
lished the systematic study of turbulence based on the 
principles of fluid dynamics and statistical analysis that 
had been developed during this period. The combina- 
tion of these mathematical approaches, together with 
limited experimental results and hypotheses based on 
the ideas of statistical physics, led to Andrey Kol- 
mogorov and Alexander Obukhov’s general statistical 
theory in 1941. New measurement technologies and 
greater computational speed and capacity have shown 
how eddy structure changes at very high Reynolds 
number (Re ^ 10 5 ), and this is leading to revision 


and clarification of physical concepts and statistical 
models. 

There are differences between turbulence and other 
kinds of random motions in fluids, such as waves on 
liquid surfaces or the chaotic motions of colliding solid 
particles. In turbulence, localized flow patterns adjust 
faster to local conditions than typical wave motions. 

The aim of this article is to show how mathemati- 
cally based research into turbulent flows, with the aid 
of experiments and computations, has led to methods 
for exploring the qualitative and quantitative aspects 
of turbulence. Key physical and statistical results are 
explained. 

2 Randomness and Structure 

The mathematical concepts and terminology that are 
used to measure, describe, and analyze the random 
fields that occur in turbulent flows are the same as 
those used for other random processes. The turbu- 
lent velocity field, v*, is broadly like other continu- 
ous three-dimensional random processes. It and other 
variables at given points in space (x) and time (f) can- 
not be exactly predicted when the Reynolds number Re 
exceeds a critical value for the region of flow where 
turbulence exists, D\ (this may be a subregion of the 
whole flow). 

A large number of experimental or computational 
trials are needed to calculate for a given flow the prob- 
ability, defined as pr(v*) dv*, of a single component 
v* of the variable lying between two values v* and 
v* + dv*. The mean value of v* is 

(v*) = [ v* pr(u*) dv*. 

J —00 

The fluctuation of v * relative to its mean value is 
v = v* - (v*). 

These “ensemble” statistical properties are mainly 
expressed in terms of the moments, correlations, and 
spectra of the fluctuation v, but particular features 
of the flow can be defined by calculating conditional 
statistics when variables lie within certain ranges or 
have particular properties. 

The mth moment of the fluctuation is defined as 

M (m) = (v m ) = ((v* - (v*)) m ). 

Thus M (1) = (v) = 0 and M (2) = (y 2 ) = v' 2 , the vari- 
ance of v (v' is the standard deviation), which defines 
the width of the distribution and, broadly, the magni- 
tude of fluctuations. The skewness Sk, which is defined 
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by Sk = M (3 ) /v' 3 , indicates whether large fluctuations 
are positive or negative (see figure 2, where Sk = 0). 

In many types of turbulent flows, such as in a wind 
tunnel, the statistics do not change with time and are 
spatially homogeneous (in one or more directions). 
Time averages taken over periods between to and to + 
T are then used to calculate mean or higher mth 
moments, denoted by overbars. For example, 



Similarly, spatial averages can be taken over distances 
between two points, xio and xio + X \ : 

(v*)xi (x 2 ,x 3 , t) 

1 rXio+Xi 

= v I [u*(xi,x 2 ,x 3 , t)] m dxj. 

Ai Jxio 

In “ergodic” flows the time and/or space averages are 
well defined and are the same for all experiments (e.g., 
by turning on and off the flow). The space and time 
means ( m = 1) are equal to the ensemble mean (v*). 

However, there are also well-defined “non-ergodic” 
turbulent flows, where time means in one experiment 
differ significantly from those in other experiments in 
the same flow. In these types of flows, a number of dif- 
ferent, but persistent, large-scale flow patterns occur, 
given the same initial or boundary conditions. Such 
nonuniqueness is found in flows in containers with one 
or more axes of symmetry, for example, in the decay of 
swirling flows in ellipsoidal containers. In such cases, 
the ensemble mean does not correspond to the space/ 
time average in any given flow. 

In most turbulent flows, pr(u ) has a single maximum 
(or mode) where v = v, which is the most frequent 
value of the fluctuations. Where Sk > 0, as in the ver- 
tical velocity in cumulus clouds (figure 1), there is a 
greater probability of large values of positive v than of 
large negative values. But since the latter are the most 
frequent values, v < 0. 

When studying probability distributions it is usual 
to start by comparing them with pr(v) for the most 
general and least repeatable type of random variable 
that can be conceived: namely, the sum of an infinite 
number of independent random variables. This is the 
Gaussian or normal distribution: 

Pg(v ) = j^ rTV , exp(-(u - (v)) 2 /2v' 2 ). 

For many of the random variables observed in turbu- 
lent flows this is a useful approximation. But if pr(p) 
slightly differs from po(v), this is physically significant 
because it indicates repeatability and structure. 


The average sizes of turbulent eddies v\ith differ- 
ent levels of energy can be estimated from statistical 
measurements of space-time two-point correlations, 
defined as the average of the product of the veloci- 
ties at two points along a line in the x\ -direction. The 
dimensionless form for homogeneous turbulence is 


C(ri,xi,t) 


<p(xi,f)v(xi +n,t)) 
<v 2 )(xi , t) 


Integrals of C define integral timescales and length 
scales, 

r 00 r OO 

T = J Cdt, 1=1 Cdr, . 


Turbulence structure and statistics are also described 
in terms of characteristic random modes. Consider a 
variable i/(xi), expressed as a sum of the products of 
space filling, nonrandom modes <fi n (x) (e.g., sinusoidal 
waves with wave numbers k n = 2nnlXi) with random 
coefficients a n : 


N 

V (x ) = X a-n<pn(x). 
n= 1 


The modes are assumed to be orthogonal to each other 
over the space Xi to X\j, i.e., 

rX u 

< $ > m(x)<fin{x) dx = Smni 
Jx L 

where 5 mn is the kronecker delta [1.2 §2, table 3]. 
The spectrum E iv ^(k n ) of v(x) is defined as the mean 
square of the amplitude coefficients: 


(a 2 ) =E iv Hkn). 


The sum of the energy of the modes is equal to the 
variance, Y.nE l ' v) (k n ) = v' 2 . The Karhunen-Loeve and 
Wiener-Khintchine theorems show how the correlation 
C and the spectrum are linked. For example, fluctua- 
tions correlated over small distances r\ (<k L) also con- 
tribute most of the energy over this length scale, i.e., 
E(k n ), where k n ~ L/ri. 

In an infinite space, E(k n ) tends to a continuous spec- 
trum E(k) as N — ■ oo. Note that the spectra of veloc- 
ity gradients dv /dx = v x , i.e., E {Vx Hk), are equal to 
k 2 E(k), so that E^ Vx> ( k ) is greatest where k is large and 
eddies are small. 

Sometimes there is a “spike” in the spectrum where 
energy is concentrated at a particular wave number k n 
or frequency co n , such as vortex shedding downwind 
of an aircraft wing. 

In turbulence, the fluctuating velocity is a three- 
dimensional vector v in three-dimensional space with 
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three components, v = (vi,V 2 ,V 3 )(x, t), the total vari- 
ance being v' 2 = (VjVj) (summing over j where indices 
are repeated). 

The anisotropy of turbulence variables can be ex- 
pressed independently of the frame of reference in 
terms of the variances of the components and of 
the correlations between the components in different 
directions, e.g., (ViVj) /v' 2 . 


3 Dynamics in a Statistical Framework 


Based on Newton's second law, the rate of change of 
momentum of a fluid element (which is much larger 
than any molecules, but much smaller than the scale 
of fluid motions), with density p* (x * , t*) and moving 
with a velocity v* (x* , t*), is equal to the molecular 
forces acting on it caused by the gradients of fluid pres- 
sure p*(x*,t*), viscous stresses t *■ , and a specified 
body force F* (x* ,t*). 

The momentum and continuity equations can be 
expressed in nondimensional forms by expressing dis- 
tances in relation to a fixed overall scale L * , velocity 
vectors in terms of a reference velocity Vq , and the time 
variation in terms of L* /Vq . From the navier-stokes 
equations [III.23], 


Dit; diii dut 
Dt dt 3 dxj 

] 

(p + p) + — V 2 Ui + fi+fu 


( 1 ) 


3 Xi ' r Re 

where X; = x* /L*, t = t* Vq /L*, Ui = v* /v ( *, and the 
mean and fluctuating values for p*, F* are 

,* _ F* 


p + p = 


P 


Po(Vq) 


*\2 ' 


fi + fi = 


p 0 (v 0 *) 2 /I*' 


Since the densities of fluid elements change little 
in most natural and engineering turbulent flows, the 
fluctuating velocity field is not divergent, i.e., V ■ u = 
duj/dxj = 0 . 

Note that the equations now contain the single nondi- 
mensional Reynolds number Re = v§ L* /v, so that the 
effect of varying the velocity scale, the length scale, 
or the kinematic viscosity (by the same amount) is the 
same as varying the single parameter Re. For turbulent 
flows, Vq may be taken as the (dimensional) standard 
deviation v’* . 

To study the flow field’s statistical properties, the 
nondimensional velocity and pressure are expressed in 
terms of their mean and fluctuating values, Ui (x, t) and 
Uj(x, t) (see section 2). 


Taking the divergence of (1) gives 
V 2 (p + p) = ~(S 2 - ^cu 2 ) + 


d(/t + fi) 
dxi 


where cv = V x u is the vorticity vector, cv = \ cv j , and 

' dut 3 Ui 


: ZijZij > 


^ij ~ 2 


1 / 3 Ui duj \ 

2 V 3 Xj + 3 Xi ) " 


Thus, the pressure, which is determined by the diver- 
gence of the inertial and body force terms, has maxi- 
mum or minimum values where the strain S is greater 
than or less than to / -Jl. 

For incompressible fluids the viscous stresses, t y, 


are proportional to Ti 


(2/Re)Zij. Note that t y 


is zero both in pure rotational flows (i.e., cv * 0 and 
S = 0), such as at the center of a vortex, and in inviscid 
flows, where Re — ■ oo. But the local viscous force at a 
point in the flow, which is equal to the stress gradient, 
dTij/dXj = -Re~ 1 (V x to)i, is zero in pure straining 
flows (i.e., to = 0) and where the vorticity is uniform. 

The dissipation of energy is given by e = 2yTy. 
It is proportional to Re~ 3 and is also zero locally for 
pure rotational flows, but its total integral in an iso- 
lated flow region Dj is proportional to cv 2 , i.e., £ = 
(2 /Re) I cv 2 dV. 

Taking the curl of (1) eliminates the pressure gradient 
and leads to an equation for the vorticity In 

dimensionless vector terms 

^ = (to ■ V)u+ -^V 2 co+ V x (/ + /), (2) 

Dt Re J J 

which shows how the inertial straining term leads to 
high and low variations in cv. The main effect of viscos- 
ity in (2) is to “diffuse” vorticity from regions of high 
vorticity to regions of low vorticity, especially from/to 
boundaries. If Re <sr 1, vorticity is diffused and there 
are no sharp peaks in vorticity. 

The body force affects only the vorticity and, from 
there, the velocity field if it is rotational; for example, 
if the force is determined by a potential <P,f + f = V#, 
it affects only the pressure. 

The momentum equation (1) defines how the mean 
and fluctuating components of the turbulent flow field, 
denoted by Ui = (Ui) and Ui, are connected dynami- 
cally. From there, the ensemble average of the rate of 
change of the turbulent kinetic energy of the fluctua- 
tions K = | (UiUt) is 

dK rr SK , V 3 Ui 1 , „ 2 , 

TT + u ii — = ~{UiUj )- — + —{UiW-Ui) 

dt dxi dxi Re 


-^-((Ujp) + \{u k u k Uj)) + < Utfi ), 


( 3 ) 
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or, schematically, {A} = {P} - {a) + {£T} + {F}. Here, 
{A} is the time variation and advection of K\ {R} is the 
production of K by interactions between the fluctua- 
tions and gradients of the mean velocity; (a) is the vis- 
cous dissipation by the smallest eddies, determined by 
the correlation between u, and viscous stresses; {£T} 
is the transfer of energy by the fluctuations, i.e., by gra- 
dients of the cross correlations of Uj and p, and by gra- 
dients of third-order moments; and {F} is the rate of 
production of K by the forcing term, which is equal to 
the correlation between Ui and 
For homogeneous flows without mean gradients or 
fluctuating body forces, (3) simply shows that K decays 
in proportion to the dissipation: 


This process in a wake is depicted in figure 1(b). 

4 Transition to Fully Developed Turbulence 

The transition from laminar flow to well-developed tur- 
bulence may take place in part of a flow, such as over 
an aircraft wing, or in the whole flow, such as in a pipe. 
The turbulence may develop for a certain time and then 
decay back into laminar flow. 

There are different stages in the transition to tur- 
bulence, which evolves from growing small-amplitude 
fluctuations, depending on the boundary conditions 
and the Reynolds number, Re. One class of initially 
transitional fluctuations are “modal,” when all the com- 
ponents in different directions of the flow grow at 
the same rate within a defined flow domain D-p The 
most unstable modes have a defined wavelength and 
frequency, so that the initial energy spectrum may 
be in a “spike.” By contrast, in “nonmodal” fluctu- 
ations, different components have different rates of 
growth despite interacting with each other (see fluid 
mechanics [IV.28 § 8 ]). 

At the beginning of the transition process (in space 
or time, or as Re increases), the dynamics are lin- 
ear. In shear layers with velocity profile U(x 3 ) with 
an inflection point where d 2 U/dx 3 = 0, the undulat- 
ing “Kelvin-ffelmholtz” modal fluctuations u are expo- 
nentially growing at a rate defined by cr, where u oc 
exp(crt). In other parts of the flow, fluctuations may 
not exist. In swirling flow between circular cylinders, 
the axisymmetric vortex-like modal fluctuations of G. I. 
Taylor extend across the whole flow. These modal fluc- 
tuations can grow only when Re exceeds the critical 
value Re C rit = AUqLq/v, where A Uq is the change in 


mean velocity across the flow and Ig is the length scale. 
Re C rit is independent of the magnitude of u provided 
it is small. Different kinds of nonlinear development 
occur as these fluctuations grow or change into other 
modal forms. 

Some basic laminar flows Ui(x, f), such as shear flow 
between planes or in the center of a vortex, are sta- 
ble to small modal disturbances, but low-amplitude 
nonmodal disturbances can grow in these flows if Re 
exceeds a critical value, Re' r it , defined by root mean 
square fluctuations. In a shear flow, arbitrary initial 
disturbances are transformed into well-defined eddy 
structures such as the elongated streaks of high and 
low streamwise fluctuations. The spectra change from 
a narrow to a broad band shape as modal and nonmodal 
fluctuations develop and become nonlinear. 

The transition process affects the structure of the 
fluctuations and their spectral energy distribution, 
ranging from large scales to the smallest viscously 
affected eddy motions. Typically, the changes occur on 
a timescale of the most energetic eddies, so that inter- 
actions take place between all the scales down to the 
smallest. In general, whatever the initial transition pro- 
cess, and whatever the boundary conditions, all types 
of turbulence retain some of their initial characteristic 
features over extended periods, as is discussed in the 
next section. 

However, most types of “well-developed turbulence” 
also have a number of common features so long as the 
Reynolds number exceeds its critical value for transi- 
tion, Re t rans- There may be multiple values of Re t rans 
above which further changes in the structure can occur. 

5 Homogeneous Turbulence 

The basic mechanisms and statistical structure of tur- 
bulence are best studied where the velocity fluctua- 
tions are homogeneous and where they are not affected 
by boundaries or the eddy structures generated by 
large-scale gradients of the mean and fluctuating veloc- 
ities. Nevertheless, even homogeneous turbulent flows 
depend on the initial or continuing forcing. 

Turbulent flows consist of random fluctuations and 
eddies, i.e., local regions of vortical flow confined 
within moving envelopes. At high Reynolds number, 
the boundaries of flow regions like clouds tend to be 
sharply defined, with high local gradients of velocity. 
Nonlinear fluctuations and eddies evolve internally and 
interact with other eddies. 
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In a fixed frame of reference (or one moving with 
the mean velocity), the one-dimensional spatial spec- 
trum £n(fci) for wave number k i in the xi-direction 
is always finite at ki = 0, i.e., £n(0) = ( u\)L x Itt ■ 
The integral scale L x defines the size of the eddies 
containing most of the energy. 

Also, £n (fci ) tends monotonically to zero as k\L x -> 
oo, which implies that the smallest eddies on average 
have small velocities. Over the range 0 < k\L x -c 1, the 
spectrum remains of the order of £n (0). For most well- 
developed flows, £n (fci ) has a single maximum where 
k 1 L x ~ 1. 

The “Eulerian” frequency spectrum for the xi com- 
ponent is denoted by E^lco). The integral timescale 
T'e = (2TTM , 1 ) _1 £f 1 (0), where u[ = (u 2 ) 1/2 , is broadly 
similar to the spatial spectrum. In shear flows with 
“coherent structures,” T'e can be significantly larger. 

At high Reynolds number, over smaller scales and 
higher frequencies, the frequency and spatial spectra 
are proportional to each other, 

£n(co) ~ (l/u'^Eniki), 

where ki = ani] . These results demonstrate how large 
eddies advect small eddies randomly at speeds approx- 
imately equal to the root mean square velocity li', , 
which explains why 'T'e ~ L x / u[ . 

The time dependence of the velocity or displace- 
ment Xi(t;Xi(t 0 )) of a fluid particle released at Xi(to) 
over numerous experiments defines the “Lagrangian” 
time-dependent statistics of its velocity Ui(t; Xi(to)) = 
u,(t). The auto-correlation (for stationary, homoge- 
neous turbulence) of the velocity over a time inter- 
val t, defined as C-( t) = (ui(t)Ui(t + t)), where 
C-( 0) = (u 2 ), also defines the Lagrangian timescale 
Tl = (it') -2 J 0 “ C^(t) dT. Tl, whichis the timescale for 
the velocity of a fluid particle as it is carried around a 
large eddy, is of the same order as T'e and also the time 
for pairs of particles to separate. Thus T'e is also the 
“interaction” time, T), for the large eddies to interact 
with each other. 

The decay of homogeneous turbulence with time 
depends on the mean loss of energy by viscous dissi- 
pation, (e), as shown by (4). 

Since e is proportional to the square of the velocity 
gradients, which are greatest for the smallest scales, 
the mean value is related to the energy spectrum by 
1 f °° 

<£> = — k 2 E(k)dk. 

Re Jo 

At high Re, these dissipative eddies are quasilaminar 
thin shear flows and vortices, with typical velocities 


uo and length scale Lq, and characteristic thickness £ 
of order LoRe~ 1/2 . Consequently, the dissipation rate 
within the layer, e, is of order Uq/Lq, which is inde- 
pendent of the value of Re. Outside these layers, e is 
very small. Even smaller-scale scattered vortices con- 
tribute significantly to (e) (see below). High-resolution 
computer simulations demonstrate that 

<£) = C f £ 3/2 /£. v , (5) 

where C f decreases as Re increases. When Re is above 
about 10 4 , C f reaches a constant value of order unity, 
depending slightly on the initial form of the turbulence. 

The turbulence is fully developed at time to and 
then decays with a self-similar structure for t > to. 
The integral timescales increase in proportion to the 
development time t - to, i.e., Te ~ Tl ~ L x /K 1/2 ~ 
(t - to). 

Combining the above equations leads to the decay 
law of turbulence, 

d K _ A d K 

df t - to’ 

where Ad has a constant value for each type of tur- 
bulence as it decays, usually in the range 1.1-1. 3. The 
turbulence energy K and the length scale L x then have 
related power-law variations in terms of their values, 
R'd and L xd , at time t d > to- 


K 

( t -t 0 

Lx_ = 

0 

+-> 

1 

Kd ~ 

V f d - to / ’ 

Lx d 

\td- to J 


Since Ad > 1, the Reynolds number of the turbulence 
decreases from its initial value Re o at t = to in pro- 
portion to K 1/2 L x cc (t - to) 1-Ad . Eventually, the local 
value of Re becomes so small that the eddy motions are 
smeared out by viscous stresses and the energy decays 
rapidly in proportion to exp( -k 2 (t - to) /Re o). 

The rate of change of the energy spectrum £(k) for 
eddies of scale k _1 is equal to the net inertial transfer of 
energy to eddies on this scale, denoted by -d£T(k)/dk, 
minus the spectrum of dissipation of energy s(k) at 
wave number fc: 

d£(k) dJI(k) .... 

— 3T — = £(k). 

df dk 

The growth of the integral scale in decaying turbu- 
lence results from the net upscale transfer of energy 
to eddies larger than L x , where f(k) is negligible. For 
eddy scales less than L x , downscale transfer exceeds 
upscale transfer, as a result of negative straining (i.e., 
du/dx < 0) in eddies larger than kr 1 amplifying gra- 
dients (3 u/dx) 2 associated with scales less than fc _1 . 
Since there is a greater probability that large-velocity 



730 


V. Modeling 


gradients are negative, the normalized skewness of 
du/dx, 

r , ((du/dx) 3 ) 

Sk3u/3x “ ((du/dx) 2)3/2’ 
has a finite negative value, typically -0.5 ± 0.2. 

For the smallest-scale motions with typical velocity 
Uyisc and large velocity gradients (where k ~ kvisc 
1 /L x ), the transfer of energy is balanced by the vis- 
cous dissipation of energy, i.e., diT/dfe ~ f(k) ~ 
Re^ 1 kvisc so ^at d£(k)/dt is small. These eddy 
motions must also be energetic and large enough for 
inertio-viscous interactions with larger scales to sus- 
tain the fluctuations, i.e., their local Reynolds number 
must be large enough that Re u V isc/k V isc ~ 1. 

In the inertial range, where k <K kvisc, since the dis- 
sipation and energy decay terms are small, the iner- 
tial transfer gradient term is very small, i.e., d/7/dk -c 
u' 3 , from which J7(k) ~ const. = /7j. By integrating 
the spectrum equation over the inertio-viscous range 
between kvisc and oo, it follows that the mean dis- 
sipation over all wave numbers is equal to Eli, i.e., 
Jo” i(k) dk = (£) = ni. 

These equations define the viscous or micro-length 
scales in terms of Re, since (e) ~ 1. Thus, Aisc = 
1 / kvisc ~ Re“ 3/4 , which may be 0.1% of the integral 
scales L x for large-scale or high-speed flows. From this, 
we infer that the typical eddy velocities in the viscous 
range are w V i S c ~ ((e) /R e) 114 ~ Re~ 1/4 u'. 

At very high Reynolds number, highly dissipative 
fluctuations (e » (e)) and large velocities of order 
u' are generated intermittently on the length scale 
Aisc within thin shear layers of thickness H. But they 
contribute little to the overall values of Cn and E(k). 

Kolmogorov and Obukhov used this statistical-phys- 
ics analysis (inspired by the poetic description of eddy 
motions and downscale energy transfer by Richardson) 
to hypothesize that the cross-correlation Cn (r) of fluc- 
tuating velocities u \ over small distances in the inertial 
range where Aisc <k r «: L x is determined by IT = (e). 
From this, using dimensional analysis (or local scaling) 
we obtain Cn(r) = ( u \) - Cn(r) = a(E) 2 / 3 r 2/3 . 

Experiments show that the coefficient a ~ 10. This 
similarity extends to correlations of third moments of 
velocity differences. Consequently, for these scales the 
energy spectrum E(k) in the inertial range also has 
a self-similar form, E(k) = ak(f) 2/3 k~ 5/3 , where the 
coefficient at = a/ 3. 

The statistics of the random displacements of fluid 
particles (released at time to from X\ = 0) in the x\- 
direction determine how particles are dispersed, and 


the distance between them increases with time t - to- 
Since dAT/df = ui(t;A'i(to) = 0), the mean square 
displacement is 



Properties of the auto-correlation show that the initial 
dispersion at small time depends on the whole spec- 
trum, i.e., for (t - to) « T l , (X\) ~ u’ 2 (t - t 0 ) 2 . 
Later, large scales determine the dispersion (as in any 
diffusion process): 

(Xl) ~ 2u' 2 T L (t- t 0 ). (6) 

The rate of spreading of pairs of particles released at 
t = to and located at X (a, (t) and X ,h) (t) is d(AA)/dt, 
where AX = \X {a) - A (i, )|. This rate is mostly deter- 
mined by the eddy motions on the scale of the sepa- 
ration distance AX. Thus, in the inertial range scales, 
i.e., Aisc < AA < L x , d(AA)/dt ~ A) 1/3 (AA) 1/3 , from 
which((AA) 2 ) ~ (£>(f- to) 3 . Over longer time intervals 
and larger separations the particle motions become 
decorrelated. Then ((AA) 2 ) is given by twice the sin- 
gle particle dispersion in (6). In this random eddying, 
some pairs of particfes stay close together over consid- 
erable distances; this explains why odors can often be 
detected over large distances. 

6 Physical and Computational Aspects 
of Inhomogeneous Turbulence 

Most types of inhomogeneous turbulent flows in engi- 
neering and geophysics, as well as in laboratory exper- 
iments, are influenced by rigid or flexible boundaries, 
or by interfaces between different regions of turbulent 
flow, such as those at the edges of wakes, jets, and 
shear layers. The overall widths of each of these flows 
is denoted by A. There may or may not be a gradient of 
the mean velocity, e.g., where dUi/dx 3 * 0. 

The eddy motions and dynamics in these flows can 
be represented schematically on a “state-space” map of 
relative timescales and length scales, as a guide to mod- 
els, statistical analyses, and computational methods. 

The x-axis of the map, f , is the ratio of the Lagran- 
gian timescale Tl to the imposed distortion time Tdis 
from when turbulence is generated or distorted. The y- 
axis is L, the ratio of the spatial integral scale L x to the 
overall width A. 

When T and L are significantly less than unity, the 
turbulence is in local equilibrium and is determined by 
local dynamics, and by slow, large-scale random forcing 
such as that in engines. 
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In free shear flows (jets, wakes, and shear layers) 
the turbulence lies between these limits in the quasi- 
equilibrium zone, where f and L are both of order one, 
even though the energy and the eddy structure may be 
changing significantly. 

Models of the mean flow Ui(Xj, t) and variances or 
eddy shear stress derived from (lijUj) of such turbu- 
lent flows are based on the momentum equation ( 1 ) and 
on approximate relations between C/ f and (iiiUj). For 
quasi-equilibrium turbulence, ( muj ) is proportional to 
the gradients of the mean velocity, 



where v e is the eddy viscosity, whose order of magni- 
tude is L x vX, where the kinetic energy and length scale 
are determined by physical arguments, (3) and (5). 

In some flows, L x is prescribed. The mean flow and K 
are then derived from the coupled equations for 17/ and 
K and from initial and boundary conditions. Generally, 
however, L x is calculated from (5) using the approxi- 
mate equation for (a) in terms of U and K. Then Ui, K, 
and (a) are calculated using up to twelve coupled par- 
tial differential equations. This is the widely used K-s 
computational method. 

In nonequilibrium turbulent flows, or in inhomogen- 
eous regions of turbulent flows (such as near obsta- 
cles) when f > 1 and/or L > 1 , quasilinear rapid- 
distortion analysis or time-dependent numerical solu- 
tions demonstrate the sensitive effects of turbulence 
structure. 

Now consider the basic types of inhomogeneous tur- 
bulence. In a uniform shear flow when U(x 3 , t) = Uo + 
SX 3 , the energy equation (3) reduces to {A} = {P}-(f), 
because the turbulence is quasihomogeneous, i.e., I <sc 
1. When S » u' /L x , the turbulent energy u’ 2 and the 
length scale L x both increase with time. Generally, the 
effect of dissipation is to reduce the growth rate. 

The mean shear distorts the eddies into elongated 
accelerating and decelerating “streaks,” with length 
comparable to L x . The shear also limits the length scale 
L x ] of the vertical fluctuations and determines the 
momentum transport across the shear. Thus, follow- 
ing Prandtl’s concept, the eddy viscosity v e ~ u’L { x ' ~ 
u' 2 IS. The spectra for these anisotropic flows are dis- 
torted but broadly similar to the spectra for homoge- 
neous turbulence. 

Shear free boundary layers occur where velocity fluc- 
tuations, with root mean square velocity m' and length 
scale L x , vary in the X 3 -direction and the mean veloc- 
ity gradient, dl/i/dx 3 , is zero. These layers occur in 


stirred or heated flows near rigid surfaces or near 
interfaces between gas and liquid flows. Here, eddy 
transport and forcing balance finite dissipation, i.e., 
0 = { ET } + {P} - {a). The normal component of the 
turbulence is blocked by the interface at X 3 = 0, i.e., 
L x is reduced and u' 3 ~ {a) 1/3 xl 13 . Very close to 
resistive interfaces, thin viscous or roughness sublay- 
ers form with thickness P s , so that the length scale 
L \ ~ (x 3 -4). 

At the edges of clouds, jets, etc., randomly moving 
thin interfacial shear layers with thickness Ti form, 
centered, for example, at X 3 = zi(xi,X 2 ,t). The local 
energy balance, 0 = {P} + {ET} - (a), shows that the 
local gradients of turbulence amplify the tangential 
components of vorticity while keeping the thickness 
of the layer much less than the integral scale L { x \ 

Turbulent shear boundary layers v\ith thickness A 
flow over resistive surfaces at X3 = 0, where U 1 = 
Ui = 0. Outside the boundary layers, there are uniform 
flows with velocity Uo in the xi -direction. In the sur- 
face layer, where X 3 <K A, the eddy structure is simi- 
lar to that in uniform shear flow. But the energy equa- 
tion reduces to a balance between production and dis- 
sipation, 0 = {P} - {a), because the energy here is not 
increasing ({A} = 0). As in shear free boundary lay- 
ers, the blocking effect of the viscous/rough surface 
determines the length scale li 3) ~ (X3 - P s )- Using 
this length scale and the estimated relations between 
characteristic velocity fluctuations u* near the surface 
(wlrich are of order u' ~ yX), the gradient of the mean 
velocity is derived, namely, dl/i/dx 3 = u*/(kx 3 ), 
where Karman’s constant coefficient, k ~ 0.4, depends 
on the general relation between u*, X3, and (e)(X 3 ). 
Here, u* 2 is equal to the mean shear stress at the 
surface. 

Integrating from the sublayer, where the boundary 
condition at the resistive surface is U\ — 0 as X 3 — 0, 
leads to the Karman-Prandtl profile outside the viscous 
sublayer: 

Ui(x 3 ) ~ -^ylog for x 3 > P s , 

where P s ~ (Pe _ 1 )/u*. 

Close to a smooth surface, where X 3 <K P s , the tur- 
bulence is damped and the profile is determined by 
viscous stresses only. The profile at the bottom of the 
viscous sublayer is 

Ui(x 3 ) = x-jReu* 2 , where P s > X 3 > 0. 

In the upper part of the turbulent shear boundary 
layer the eddies are affected by the mean shear flow 
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and by interactions with the thin shear layer of the 
undulating interface. The mean “jump” velocity across 
the interface, AU, is found to be of the order of u ' , as 
occurs in interfacial shear layers. When averaged, the 
mean velocity profile is a smooth adjustment from the 
surface layer to the free stream velocity, where X 3 ~ A. 

Evidently, different types of thin layers affect the 
structure of most turbulent flows and, therefore, their 


sensitivity to factors such as buoyancy, or changes to 
the external mean flow. 

Further Reading 

Davidson, P. A. 2004. Turbulence: An Introduction for Scien- 
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Tennekes, H., and J. L. Lumley. 1972. A First Course in 
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Part VI 
Example Problems 


VI. 1 Cloaking 

Kurt Bryan and Tanya Leise 


1 Imaging 

An object is cloaked if its presence cannot be detected 
by an observer using electromagnetic or other forms 
of imaging. In fact, the observer should not notice that 
cloaking is even occurring. Cloaking has a long history 
in science fiction, but recent developments have put the 
idea on a firmer mathematical and physical basis. 

Suppose that an observer seeks to image some 
bounded region Q in space. This is to be accomplished 
by injecting energy into Q from the outside, then 
observing the response: that is, the energy that comes 
out of Q. An example of this process is radar: the 
injected energy consists of electromagnetic waves and 
the observed response consists of the waves reflected 
by objects in the region. Many other types of energy can 
be used to form images, for example, acoustic (sonar), 
mechanical, electrical, or thermal. In each case energy 
is injected into Q, response data is collected, and from 
this information an image may be formed by solving an 
inverse problem. To cloak an object the relevant phys- 
ical properties of Q must be altered so that the energy 
“flows around” the object, as if the object were not 
there. The challenge is to do this in a way that is math- 
ematically rigorous and physically implementable. One 
successful approach to cloaking is based on the idea of 
transformation optics, which we will now discuss in the 
context of impedance imaging. 

In impedance imaging, the bounded region Q c R" 
to be imaged consists of an electrically conductive 
medium, with n = 2 or n = 3. Let the vector x rep- 
resent position in a Cartesian coordinate system, let 
u = u ( x ) represent the electrical potential inside Q, 
and let J = J(x) represent the electric current flux in 
Q. We assume a linear relation J = -aVu (a form of 
Ohm’s law), where for each x the quantity a = cr(x) is 


a symmetric positive-definite nx n matrix. The matrix 
cr is called the conductivity of Q. and dictates how vari- 
ations in the potential induce current flow. If cr = yl 
for some scalar function y(x) > 0 (I is the nx n iden- 
tity matrix), then the conductivity is said to be isotropic : 
there is no preferred direction for current flow. Other- 
wise, cr is said to be anisotropic. In impedance imaging 
the goal is to recover the internal conductivity of Q 
using external measurements. 

Specifically, to image a region Q, the observer applies 
an electric current flux density g with f dn g As = 0 
to the boundary dD. If charge is conserved inside Q, 
then V ■ J = 0 in Q and the potential u satisfies the 
boundary-value problem 

V ■ crVrt = 0 in Q, (1) 

(aVu) ■ n = g on dD. (2) 

Here n denotes an outward unit normal vector field 
on dQ, and we assume that each component of the 
matrix cr is suitably smooth. The boundary- value prob- 
lem (1), (2) has a solution u that is uniquely defined up 
to an arbitrary additive constant. This solution depends 
on g, of course, but also on the conductivity cr(x). 
By applying different current inputs g and measuring 
the resulting potential / = u Ian, an observer builds up 
information about the conductivity cr. If cr = yl is 
isotropic, then knowledge of the response / for every 
input g uniquely determines y(x); that is, the observer 
can “image” an isotropic conductivity with this type of 
input current/measured voltage data. However, a gen- 
eral anisotropic conductivity cannot be uniquely deter- 
mined from this type of data, opening the way for 
cloaking. 

2 Cloaking: First Ideas 

Suppose a region Q has isotropic conductivity cr = /, 
so y = 1. We wish to hide an object with different 
conductivity inside Q in such a way that the object is 
invisible to an observer using impedance imaging. One 
approach is to remove the conductive material from 
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some subregion D c Q, where D n dQ = 0, thereby 
creating a nonconductive hole for hiding an object. For 
example, take D to be a ball B p (p) of radius p cen- 
tered at p e O. In this case the boundary- value problem 
(1), (2) must be amended with the boundary condition 
Vu ■ n = 0 on 3 D, since current cannot flow over 3 D. 
Unfortunately, this means that the hole D is likely to 
be visible to the observer, for the region Q \ D typically 
yields a different potential response on dO than does 
the region Q. The difference in input-response map- 
pings for Q versus Q\B p (p) grows like 0(p n ) with the 
radius p of the hole (measured via an operator norm). 
The visibility of the hole is therefore proportional to its 
area or volume. To hide something nontrivial, we need 
p to be large, but the observer can then easily detect it. 

3 Cloaking via a Transformation 

Let us consider the special case in which O is the unit 
disk in M 2 . We will show how to make a large non- 
conductive hole in Q, say a hole Bi/ 2 ( 0 ) of radius | 
centered at the origin, essentially undetectable to an 
observer. We do this by “wrapping” the hole with a care- 
fully designed layer of anisotropic conductor. Let Q p 
denote Q\B P ( 0 ) and let r denote distance from the ori- 
gin. Choose p £ (0, * ) and let <p be a smooth invertible 
mapping from Q p to Qi /2 with smooth inverse, with 
the properties that <p maps r = p to r = | and <p fixes 
a neighborhood \ < po < r ^ 1 of the outer bound- 
ary r = 1. Such mappings are easily constructed. Let 
y = <p(x) and define a function v(y) = u(<J> _1 (y)) 
on X3 i/ 2. The function v satisfies the boundary-value 
problem 

V • crVv = 0 in fii/ 2 , 

V v ■ n = g on 3 Q, 

(crVv)-n = 0 onr=|, 
where cr is the symmetric positive-definite matrix 
= D4>(x)(P(p(x)) T 

3 |det(D</>(x))l 

and Dip is the Jacobian of <p. Also, because </> fixes r = 
1, we have v = u on the outer boundary. 

The quantity cr may be interpreted as a conductivity 
consisting of an anisotropic shell around the hole, but 
with cr = I near 3 Q. For any input current g, the poten- 
tial v on the region Q\j 2 with conductivity cr has the 
same value on r = 1 as the potential u p on Q p . If p ~ 0, 
then u p - uq = 0(p 2 ), where uo is the potential on the 
region Q with no hole. The anisotropic conductivity cr 
thus has the effect of making the hole B 1/2 (0) appear to 


(a) (b) 




(c) (d) 




Figure 1 Comparison of flow lines of the current J = 
-crVu on annuli Q c R 2 , with input flux g(0) = 
(crVn) ■ n| r= i = cosd + sind (in polar coordinates), 
(a) Empty disk, (b) Uncloaked Lh/ 10 . (c) Uncloaked L?i/ 2 . 
(d) Cloaked Lh/ 2 . The regions in (a)-(c) have isotropic con- 
ductivity 1, while the near-cloaked annulus in (d) has con- 
ductivity cr given by the transformation described in sec- 
tion 3. For (a), (b), and (c), the resulting boundary potential 
is/(d) = (cosd+sind)(l+p 2 )/(l-p 2 ),withp = 0 ,p = 
and p = j, respectively, so that these regions will be distin- 
guishable to an observer. In (d), the near-cloaked annulus 
has anisotropic conductivity cr corresponding to p = j^j, 
for which the mapping between / and g is identical to that 
in (b). 


be a hole of radius p, effectively cloaking the larger hole 
if p » 0 (see figure 1). In the limit p 0 + , the resulting 
cloaking conductivity is singular but cloaks perfectly. 

4 Generalizations 

The transformation optics approach to cloaking is not 
limited to impedance imaging. Many forms of imag- 
ing with physics governed by a suitable partial dif- 
ferential equation may be amenable to cloaking. The 
key is to define a suitable mapping <p from the region 
containing something to be hidden to a region that 
looks “empty.” Under the appropriate change of vari- 
able, one obtains a new partial differential equation that 
describes how to cloak the region, and the coefficients 
in this new partial differential equation may possess a 
reasonable physical interpretation (e.g., an anisotropic 
conductivity). The transformation thus prescribes the 
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required physical properties the medium must possess 
in order to direct the flow of energy around an obsta- 
cle in order to cloak it. The construction of materi- 
als with the required properties is an active and very 
challenging area of research. 

Further Reading 

Bryan, K., and T. Leise. 2010. Impedance imaging, inverse 
problems, and ffarry Potter's cloak. SIAM Review 52:359- 
77. 

Greenleaf, A., Y. Kurylev, M. Lassas, and G. Uhlmann. 2009. 
Cloaking devices, electromagnetic wormholes, and trans- 
formation optics. SIAM Review 51:3-33. 

Kohn, R. V., H. Shen, M. S. Vogelius, and M. I. Wein- 
stein. 2008. Cloaking via change of variables in electric 
impedance tomography. Inverse Problems 24:1-21. 

VI.2 Bubbles 

Andrea Prosperetti 

Ordinarily, the term “bubble” designates a mass of gas 
and/or vapor enclosed in a different medium, most 
often a liquid. Soap bubbles and the bubbles in a 
boiling pot are very familiar examples, but bubbles 
occur in many, and very diverse, situations of great 
importance in science and technology: air entrainment 
in breaking waves, with implications for the acidifi- 
cation of the oceans; water oxygenation, with conse- 
quences for, for example, marine life and water purifi- 
cation; volcanic eruptions; vapor generation, e.g., for 
heat transfer, power generation, and distillation; cavi- 
tation and cavitation damage, e.g., in hydraulic machin- 
ery and on ship propellers; medicine, e.g., in decom- 
pression sickness (the “bends”), kidney stone fragmen- 
tation (lithotripsy), plaque removal in dentistry, blood 
flow visualization, and cancer treatment; beverage car- 
bonation; curing of concrete; bread making; and many 
others. There is therefore a very extensive literature on 
this subject across a wide variety of fields. 

It is often the case that liquid masses in an immiscible 
liquid are also referred to as bubbles, rather than, more 
properly, “drops.” This is more than a semantic issue as 
the most characteristic — and, indeed, defining— feature 
of bubbles is their large compressibility. When a bub- 
ble contains predominantly vapor (i.e., a gas below 
its critical point), condensation and evaporation are 
strong contributors to volume changes, which can be 
so extreme as to lead to the complete disappearance 
of the bubble or, conversely, to its explosive growth. 
Bubbles containing predominantly an incondensible 


gas are usually less compressible but still far more 
so than the surrounding medium. Some unexpectedly 
large effects are associated with this compressibility 
because, in a sense, a bubble represents a singularity 
for the host liquid. 

To appreciate this feature one can look at the sim- 
plest mathematical model, in which the bubble is 
spherical with a time-dependent radius R(t) and is 
surrounded by an infinite expanse of incompressible 
liquid. The dynamics of the bubble volume is governed 
by the so-called Rayleigh-Plesset equation, which, after 
neglecting surface tension and viscous effects, takes 
the form 

RU)^ + f U 2 = -(Pi-Poo). (1) 

dt £ - p 

Here, U = dR/dt, p is the liquid density, and p*, poo are 
the bubble’s internal pressure and the ambient pres- 
sure (i.e., the pressure far from the bubble). If these two 
pressures can be approximated as constants, this equa- 
tion has an energy first integral, which, for a vanishing 
initial velocity, is given by 



with the upper sign for growth ( p* > p M , R( 0) ^ R{t)) 
and the lower sign for collapse (p; < p ro , R( 0) ^ R(t)). 
In this latter case, by the time that R(t) has become 
much smaller than R (0) , this expression would predict 
U °c -U-3/2, which diverges as R(t) — ■ 0. Of course, 
many physical effects that are neglected in this simple 
model (particularly the ultimate increase of p; but also 
liquid compressibility, loss of sphericity, viscosity, and 
others) prevent an actual divergence from occurring, 
but, nevertheless, this feature is qualitatively robust 
and responsible for the unexpected violence of many 
bubble phenomena. 


The approximation p, =; const, is reasonable for 
much of the lifetime of a bubble that contains mostly 
a low-density vapor. A frequently used model for 
an incondensible gas bubble assumes a polytropic 
pressure-volume relation of the form p; x (volume ) K = 
const., with k a number between 1 (for an isothermal 
bubble) and the ratio of the gas specific heats (for an 
adiabatic bubble). The internal pressure p; in the sim- 
ple model (1) is then replaced by p* = PiofRo/R) 3 * for 
some reference values Ro and p,o- 


The interaction of bubbles with pressure distur- 
bances, such as sound, is often of interest, e.g., in 
underwater sound propagation, medical ultrasonics, 
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ultrasonic cleaning, and other fields. Provided the wave- 
length is much larger than R (which is usually the case 
in the vast majority of situations of interest), the effect 
of a sound field with frequency / can be accounted for 
in the model (1) by setting poo = Po + PACOs(2rr/f), 
where po is the static pressure and pa the sound pres- 
sure amplitude at the bubble location. With this adapta- 
tion and pi given by the polytropic model, (1) describes 
strongly nonlinear oscillations and gives an explana- 
tion of the richness of the acoustic spectrum scattered 
by bubbles. In the limit of small-amplitude oscillations, 
it is easy to deduce from (1) an expression for the reso- 
nance frequency /o of the bubble, namely, 2nfoRo = 
yfjKpoJp, which, for an air-water system at normal 
pressure, gives approximately foRo — 3 kHz x mm. 

The gas in an oscillating gas bubble alternately cools 
and heats up. The phase mismatch between the result- 
ing heat exchanges with the liquid and the driving 
sound pressure is responsible for major energy losses, 
which, for bubbles larger than a few microns, domi- 
nate over other losses (e.g., those due to a moderate 
viscosity or sound reradiation). This energy loss is the 
cause of the strikingly dull sound that a glass full of 
carbonated beverage emits when struck with a solid 
object. If the sound intensity (e.g., that produced by a 
transducer in a resonant system) is sufficiently strong, 
then the increase in the temperature of the gas can 
be so large as to give rise to exotic chemical reactions 
(sonochemistry) and even to the generation of a plasma, 
which emits brief flashes of light synchronous with the 
periodic collapses of the bubble (sonoluminescence). 

Bubbles containing predominantly vapor can form 
due to a temperature increase, as in boiling, but also 
as a result of a lowering of local pressure, as in cav- 
itation. Such bubbles are much more unstable than 
bubbles that contain a significant amount of inconden- 
sible gas. They may keep growing and be ultimately 
removed by buoyancy, as in fully developed boiling, or 
they might violently collapse as soon as they encounter 
cooler liquid (hence the hissing noise of a pot that is 
just about to boil) or the pressure recovers. 

While the spherical model is useful to understand 
many aspects of bubble phenomena, in many cases it is 
an oversimplification. For example, the spherical shape 
can be unstable when a bubble compresses because sur- 
face perturbations grow in amplitude as they are con- 
fined into a smaller and smaller surface or as the bubble 
contents begins to push out to limit the collapse veloc- 
ity, thus giving rise to a Rayleigh-Taylor unstable sit- 
uation. Spherical instability is strongly favored in the 


neighborhood of a solid; the collapsing bubble devel- 
ops an involuted shape with a liquid jet that traverses 
the bubble and is directed against the solid surface. 
The velocity of the jets thus formed can reach several 
tens of meters per second and can contribute strongly 
to the effects of cavitation, both undesirable (metal 
fatigue and failure, vibration, noise) and desirable (den- 
tal plaque removal, kidney stone comminution, clean- 
ing of jewelry and small electronic components). 

Surface tension is the physical process that tends 
to keep bubbles spherical against the action of other 
agents, such as the instabilities just mentioned, but 
also gravity and translational motion, as in the case of 
the buoyant rise of bubbles in a liquid. Deformation 
due to gravity is often quantified by the Eotvos number 
Eo = pd 2 g/a (where d is the diameter of a sphere of 
equal volume; g is the acceleration due to gravity; and 
cr is the surface tension coefficient), which expresses 
the balance between gravity and surface tension. The 
effect of translation is quantified by the Weber num- 
ber We = pu 2 d/cr (where u is the translational veloc- 
ity with respect to the liquid), which expresses the bal- 
ance between inertia and surface tension. It is also a 
common observation that bubbles often fail to ascend 
along a rectilinear path: the flattening of the bubble into 
a roughly ellipsoidal shape causes a loss of stability 
of the rectilinear trajectory, a phenomenon known as 
Leonardo’s paradox. 

In a boiling pot or in a champagne glass, bubbles 
are often seen to originate from preferred spots rather 
than randomly over the entire surface of the container 
or in the bulk of the liquid. The reason for this phe- 
nomenon is that intermolecular forces (the origin of the 
macroscopic surface tension) make it very difficult to 
nucleate (i.e., expand) a bubble starting from a “molec- 
ular hole.” In the vast majority of cases bubbles orig- 
inate from preexisting micron-size nuclei, mostly con- 
sisting of gas pockets stabilized somehow, either on 
floating particles or, more frequently, on the surface 
of the container. The additional pressure necessary to 
expand these nuclei to visible size against surface ten- 
sion causes, for example, the boiling incipience of water 
at atmospheric pressure to occur at a temperature a few 
degrees higher than 100 "Cora carbonated beverage 
not to rush in a foamy mess out of a newly opened can 
(provided it has not been shaken beforehand, an action 
that injects a multitude of nuclei into the liquid). 

When a bubble detaches from a nozzle, or the jet in 
a bubble collapsing near a wall strikes the other side of 
the bubble surface, or a bubble is entrapped in a liquid 
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due to a surface disturbance (e.g., an impacting drop, a 
breaking wave), the liquid surface undergoes a topolog- 
ical change that, in a macroscopic sense, amounts to a 
mathematical singularity. Processes of this type repre- 
sent a significant numerical challenge in computational 
modeling. 

In many cases bubbles occur in groups, or clouds, 
which, for certain purposes, can be considered as turn- 
ing the liquid into a homogeneous medium with effec- 
tive properties different from those of the pure liquid. 
Examples are the bubble clouds produced by a break- 
ing wave, which endow the liquid mass in which they 
reside with a compressibility much larger than that of 
the surrounding bubble-free liquid. As a consequence, 
these clouds can execute volume pulsations that result 
in a significant low-frequency (from about 50 Hz to over 
1000 Hz) ambient noise in the ocean. Clouds of bub- 
bles deliberately injected at the bottom of ponds or in 
some industrial processes (e.g., glass making, chemi- 
cal industry) are used to destratify and mix the liquid, 
or to greatly increase the gas-liquid contact area to 
facilitate the occurrence of chemical reactions. There 
is a significant body of literature on the application of 
homogenization and various statistical methods to the 
derivation of effective properties for such systems. 

Some types of flow cavitation are characterized by 
the formation of large vapor cavities that are periodi- 
cally shed and break up into bubble clouds when they 
are transported into regions of higher pressure. The 
collapse of these clouds proceeds by a cascade pro- 
cess that greatly enhances the destructive action of 
cavitation phenomena. 

The massive formation of vapor bubble clouds that 
can occur, for example, when a pressurized liquid- 
filled container is opened, when liquefied natural gas 
comes into contact with water (“cold explosion”), or 
when a liquid that is highly supersaturated with a gas 
is exposed to lower pressure (e.g., the CO 2 eruption of 
Lake Nyos, Cameroon, in 1986) can be a violent and 
devastating phenomenon. The mathematical modeling 
of these processes offers numerous challenges, many 
of which are still far from having been satisfactorily 
met. 

Further Reading 

Brenner, M. P., S. Hilgenfeldt, and D. Lohse. 2002. Single- 
bubble sonoluminescence. Reviews of Modem Physics 74: 

425-84. 

Leighton, T. G. 1994. The Acoustic Bubble. New York: Aca- 
demic Press. 


Plesset, M. S., and A. Prosperetti. 1977. Bubble dynamics and 
cavitation. Annual Review of Fluid Mechanics 9:145-85. 
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VI.3 Foams 

Denis Weaire and Stefan Hutzler 


1 Introduction 

Ever since J. A. F. Plateau laid down the foundations 
of much of the theory of liquid foam structures in 
the middle of the nineteenth century, we have under- 
stood the problem in terms of the minimization of 
the total areas of soap films, under appropriate con- 
straints. The subject therefore relates to the theory of 
minimal surfaces, which poses important pure mathe- 
matical problems, one of which is an existence theorem 
for a single film, named in honor of Plateau himself. 
More practical problems deal with the complex disor- 
dered arrangements of bubbles, as in figure 1. Theory 
is often confined to static equilibrium or slowly varying 
(quasistatic) cases. 

Figure 1 is an idealized representation of a dry' foam, 
that is, one of very low liquid content, so that it con- 
sists entirely of thin films, represented by surfaces that 
meet symmetrically at 120°, three at a time, in lines 
(Plateau borders), as required for stable equilibrium. 
Furthermore, the lines themselves meet symmetrically, 
four at a time, at tetrahedral angles of 109.47°, another 
necessary condition for equilibrium and stability. 

A foam may also be wet and have a finite liquid frac- 
tion <f>. Generalizing the idealized model, all the liquid 
is contained in the Plateau borders, whose cross sec- 
tion swells as <f> increases. At a maximum value of 
4>c — 0.36, corresponding to the porosity of a random 
packing of spheres, the bubbles become spherical (the 
wet limit), as in figure 2. 

The main complication in determining or describ- 
ing these structures in detail is the awkward geomet- 
rical form of their constituent surfaces. Each has con- 
stant total curvature (by the law of Laplace and Young, 
A P = 2y(Ci + C 2 ), where A P is the pressure difference 
between neighboring bubbles, and y denotes the sur- 
face tension), but the two principal curvatures Ci and 
C 2 can vary. Only in very special cases can the form 
of such a surface be captured by explicit mathemati- 
cal expressions. Computer simulation therefore plays 
a strong role in the subject. 
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Figure 1 A computer simulation of a dry foam using Ken 
Brakke’s Surface Evolver program. (This simulation was car- 
ried out by Andy Kraynik and is reproduced with his kind 
permission.) 



Figure 2 Wet foams resemble sphere packings. (The exam- 
ple shown was computed by Andy Kraynik (Kraynik 2006) 
and is reproduced with kind permission of Wiley-VCH 
Verlag GmbH & Co. KGaA.) 



Figure 3 Surface Evolver represents the surfaces of a bub- 
ble as a triangular mesh that can be refined to a high degree. 
The example shown is the Weaire-Phelan structure for dry 
foam: a space-filling arrangement of equal-volume bubbles 
and minimal surface area. 

2 Simulation 

For more than two decades, Ken Brakke’s Surface 
Evolver has been the favorite method of simulation. 1 
It represents the surfaces as tessellations (figure 3) 
and can deal with a variety of constraints and refine- 
ments. Samples of thousands of bubbles have been sim- 
ulated (figure 1), and many physical properties have 
been determined. These include the statistical proper- 
ties of the structure, elastic moduli, yield stress, and 
coarsening (the evolution that results from diffusion 
of gas between bubbles). 

3 Heuristic Models 

Simpler, more approximate models are often invoked 
to avoid the heavy computational demands of the accu- 
rate Surface Evolver simulations. Bubbles are often 
represented by overlapping spheres (or circles, in two 
dimensions), with a repulsive force associated with the 
overlap. Figure 4 shows the variation of shear modu- 
lus against liquid fraction for a typical simulation. Sim- 
ilar models are used in the theory of granular matter, 
and the two subjects find common ground when foams 
are studied close to the wet limit, when the bubbles 


1. Surface Evolver can be download for free at www.susqu.edu/ 
brakke/evolver/evolver.html. 
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Dry Liquid fraction Wet 


Figure 4 The variation of shear modulus with liquid frac- 
tion for two-dimensional random foam. The data points are 
results from bubble model simulations for a number of 
samples; the solid line is a least-squares fit. As the shear 
modulus approaches zero (at a value of <p c - 0.16 in 
two dimensions), the foam loses its rigidity and is better 
described as a bubbly liquid. 


may (paradoxically) be described as hard spheres. A 
complex of fascinating questions arises in that limit, 
generally gathered under the heading of “jamming.” 

4 Continuum Models 

In the context of engineering applications, foams are 
often described by semi-empirical continuum repre- 
sentations, which may be justified to some extent by 
appeal to the detailed microscopic models that we have 
described. 

In rheology, the foam may be taken to be a contin- 
uous medium with the characteristics of the Bingham 
or Herschel-Bulkley models, which describe the foam 
as an elastic solid, for stress S below a threshold (the 
yield stress S y ). When subjected to stress above this 
threshold, the foam flows according to 

S = S Y + cy a , (1) 

where y is the strain rate, c is called the consistency, 
and the exponent a may be determined from simula- 
tions or experiments. In recent years such models of 
non-Newtonian fluids have been the basis of debates on 
shear localization. In a nonlinear system such as this, 
the response to an imposed shear stress may be flowing 
within a shear band. 

For foam drainage (the passage of liquid through a 
foam under gravity and pressure gradients), a partial 
differential equation may be developed for the evolu- 
tion of liquid fraction <fi(x, t) as a function of vertical 



Figure 5 The merging of two solitary waves with respective 
velocities v a and vy, (shown here as numerical solution of 
the foam drainage equation) has also been seen in foam 
drainage experiments. 


position and time. In its most elementary form, it has 
been called the foam drainage equation, and it is given 

-*♦’)■ « 2 > 
where Ci and C 2 are parameters containing values for 
the viscosity, density, and surface tension of the liquid, 
the mean bubble diameter, and geometrical constants 
related to foam structure. 

Various elaborations have been advanced, e.g., tak- 
ing into account different boundary conditions at the 
gas-liquid interfaces, tn simple circumstances (one- 
dimensional flow), the foam drainage equation has 
interesting analytic solutions, such as a solitary wave 
propagating with velocity v. 
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The merging of two such waves is shown in figure 5. 


5 More General Computational Models 

More recently, attention has turned to phenomena 
that require models and methods that go beyond the 
quasistatic regime. 

In 2013 Saye and Sethian developed a formalism to 
accurately describe local fluid motion within bubbles, 
soap films, and their junctions, and they applied this 
to the evolution of a bubble cluster. Three phases of 
evolution were identified and separated for the pur- 
poses of simulation. These are the approach to equilib- 
rium, involving rearrangements of bubbles, followed by 
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liquid drainage through the films and Plateau borders, 
and finally him rupture caused by thinning. This last 
event throws the system far out of equilibrium so that 
we may return to the first phase, and so on. 

Further progress in both theory and computation is 
required to model the types of foams that feature in 
many applications and that are often wet and laden 
with particles. 


6 The Kelvin Problem 

The Kelvin problem asks the following question. For 
a dry foam of equal-sized bubbles, what arrangement 
has lowest energy (i.e., total surface area)? As a prob- 
lem in discrete geometry, this seems rather intractable. 
Kelvin himself offered an inspired conjecture in 1887, 
in which the bubbles (or cells) were arranged in the 
body-centered cubic crystal structure. This remained 
the best candidate until 1994, when Weaire and Phe- 
lan identified a structure of lower energy, using Surface 
Evolver (figure 3). 

In terms of rigorous proof, the Kelvin problem 
remains open, but few doubt that the Weaire-Phelan 
structure will prevail. It comes closest to satisfying 
some criteria for average numbers of cell faces (and 
other properties) that may be loosely argued as follows. 
One may conceive an ideal polyhedron for area mini- 
mization that has flat faces and the tetrahedral angle 
at the vertices, but this cannot be realized since it is eas- 
ily shown that it has noninteger numbers of faces and 
edges. Average values for the Weaire-Phelan structure 
come close to this ideal. 

7 Two Dimensions 

Many of the basic questions of foam physics may be 
pursued in two dimensions, which may be realized 
experimentally as a sandwich of bubbles between two 
plates (figure 6). 

In two dimensions much of the mathematics is radi- 
cally simplified: ideally, the bubbles occupy polygonal 
cells with circular sides. Much more can be adduced 
by way of exact results; for example, the implication of 
Euler’s theorem that the average number of sides of a 
cell is six. 

Von Neumann pointed out that if gas diffusion 
between cells is proportional to pressure difference, 
each cell grows (or shrinks) at a rate determined only 
by its number of sides n, apart from a constant of 
proportionality: 

dArc 

—rr ^ (n- 6), (3) 

dt 



Figure 6 The simplified geometry of a two-dimensional 
foam, realized as bubbles squeezed between two plates, as 
in this photograph, makes the computation of its proper- 
ties more tractable. Many of the results are also relevant 
for three dimensions. 


where A n is the area of an n-sided cell. The effect is 
a gradual coarsening of the structure, as cells progres- 
sively vanish. 

Another example of the tractability of problems in 
two dimensions is the counterpart of the Kelvin prob- 
lem: Thomas Hales has produced a fairly elementary 
proof that the honeycomb pattern has the minimum 
line length. 

Empirical correlations are also found, such as that 
between the number of sides n of a cell and m, the 
average number of sides of its neighbors (this is the 
Aboav-Weaire law): 


m = 6 - a + 


6 a + pz 
n 


(4) 


Here, P 2 is the second moment of the cell side distri- 
bution about the mean, and a - 1.2 is an empirical 
parameter. 

Such relations all have their counterparts in a three- 
dimensional foam, but only as debatable approxima- 
tions. 
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VI.4 Inverted Pendulums 

David Acheson 


In 1908 a mathematician at Manchester University 
called Andrew Stephenson discovered that a rigid pen- 
dulum can be stabilized upside down if its pivot is 
vibrated up and down at high frequency. 

Suppose, then, that we let a denote the amplitude of 
the pivot motion and m the pivot frequency, so that the 
height of the pivot is a sin cot at time t. 

The simplest case is for a single light rod of length l 
with a point mass at one end. Stephenson assumed in 
his mathematical analysis that a <k l and showed that 
the inverted state will then be stable if am > yj2gl. 

He confirmed his results experimentally, and with l = 
10 cm and a = 1 cm, say, the critical value of m/2n 
turns out to be about 22 Hz. 

This is all very different, incidentally, from balancing 
an upturned pole on the palm of one hand, for there 
is no “feedback” in Stephenson’s experiment, and the 
pivot vibrations are completely regular and strictly up 
and down. 

The phenomenon was rediscovered and brought to 
wider attention by P. L. Kapitza in the 1950s, and it is 
now a well-known curiosity of classical mechanics. 

1 An Inverted Pendulums Theorem (1993) 

It is less well known, perhaps, that the same “trick” 
can be performed with any finite number of linked 
pendulums, all balanced on top of one another. 

The theorem in question works by relating the sta- 
bility of the inverted state to just two simple proper- 
ties of the pendulum system in its downward-hanging, 
unvibrated state. 



Suppose, then, that we have N pendulums hang- 
ing down, one from another, with the uppermost one 
attached to a (fixed) pivot. There will be N modes 
of small oscillation of this system about the down- 
ward state, each with its own natural frequency. In the 
lowest-frequency mode, for example, all the pendulums 
swing in the same direction at any given moment, while 
in the highest-frequency mode adjacent pendulums 
swing in opposite directions. 

Let cumin and (o max denote the lowest and highest nat- 
ural frequencies, and suppose, too, that m^^ » coj^ n , 
which is usually the case when N ^ 2. 

The whole system can then be stabilized in its upside- 
down state by vibrating the pivot up and down with 
amplitude a and frequency to such that 


0.4500 

2 

frJmax 


(i) 


and 


am > 


V2 g 


( 2 ) 


t^mln 

The stable region in the ato-plane therefore has the 
characteristic shape indicated in figure 1, where the 
straight line BC corresponds to (1) and the curve AB 
to (2). 

The sketches indicate what happens if we gradually 
leave the stable region. If the frequency to is reduced 
far enough that (2) is violated, the pendulums collapse 
by slowly wobbling down on one side of the vertical. 
If, instead, the drive amplitude a is increased enough 
for (1) to be violated, the pendulums first lose their 
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Figure 2 An inverted triple pendulum. 


stability to an upside-down buckling oscillation at fre- 
quency * to. Further increases in a then cause a rather 
more dramatic collapse. 

The numerical values of co m i n and (o max depend, of 
course, on the number of pendulums involved, and 
their various shapes, sizes, and mass distributions. If 
many pendulums are involved, co max is typically quite 
large, so the pivot amplitude a has to be small to sat- 
isfy (1). This in turn means that the pivot frequency 
to needs to be very large in order to satisfy (2), and in 
this way it becomes comparatively difficult to stabilize 
a long chain of pendulums in its inverted state, as one 
would expect. 

2 Experiments 

These strange theoretical predictions were verified 
experimentally— for two-, three-, and four-pendulum 
systems — by Tom Mullin in the early 1990s. 


Figure 2 shows a three-pendulum system, stabilized 
upside down, recovering from a substantial distur- 
bance and gradually wobbling back to the upward 
vertical. 

In this particular case the rods all have length l = 
19 cm, and they are joined by two low-friction bearings, 
each of which weighs nearly twice as much as one of the 
rods. As a result, 

( 0mln = 0.729 and o) max = 2.174^) 1/2 , 

so the theorem predicts that this particular system will 
be stable in the inverted state if 

Y < 0.095 and y — ttita > l- 94 - 
l l (g/l) 1 ' 2 

These predictions were confirmed by the experimen- 
tal data, and in figure 2 a is 1.4 cm and uj/2tt = 
35 Hz. 

3 Nonlinear Dynamics 

We were particularly surprised, in fact, by just how sta- 
ble the inverted multiple-pendulum system could be 
in the actual experiment. The theorem itself is based 
on linear stability theory and therefore guarantees sta- 
bility only with respect to infinitesimally small distur- 
bances. Yet in the experiments we found that we could 
gently push the pendulum column over by as much as 
4 5 ° or so and, provided we kept it reasonably straight in 
the process, it would then recover and gradually settle 
again on the upward vertical. 

This kind of robust behavior had been evident, how- 
ever, in our computer simulations of one-, two- and 
three-pendulum systems based on the full, nonlin- 
ear differential equations of motion (including a small 
amount of friction). 

And these numerical simulations revealed another 
surprising feature: in some regions of parameter space 
for which the inverted state is stable, there is a sec- 
ond, entirely different, way in which the pendulums 
can avoid falling over. For a certain range of initial con- 
ditions they will settle instead into curious “multiple- 
nodding” oscillations about the upward vertical. 

The simplest of these is a double-nodding mode, at 
frequency \w, with adjacent pendulums on opposite 
sides of the upward vertical at any given moment, and 
each pendulum nodding twice on one side before flip- 
ping over to the other. For reasonably large values of 
to this mode exists when a is between about 70% and 
90% of the maximum value allowed by (1). 
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There is also a triple-nodding mode, at frequency 
| to, with each pendulum nodding three times in suc- 
cession on each side, for a between about 60% and 
70% of its maximum value. Again, the system will only 
gradually settle into such an oscillation if given a suf- 
ficient nudge toward it; if the inverted, upright state 
is disturbed sufficiently slightly, the pendulums just 
gradually settle back again on the upward vertical. 

In principle, there exist even more exotic oscillations 
of this general kind. With an inverted double pendulum, 
for instance, it is possible to get an asymmetric oscil- 
lation of double-nodding type in which both upside- 
down pendulums stay on one side of the upward ver- 
tical throughout. This was discovered by first setting 
up a standard, symmetrical double-nodding oscillation 
and then moving in the general direction of the point B 
in the stability diagram by gradually changing both a 
and co. At present, however, all of these strange non- 
linear oscillations are just theoretical predictions, and 
they have yet to receive proper experimental study. 

4 Not Quite the Indian Rope Trick 

The principal results of the theorem above have had a 
fair amount of media attention over the years, largely 
because of a (loose!) connection with the legendary 
Indian rope trick. 

Yet one clear prediction from the theory is that our 
particular gravity-defying “trick” cannot be done with a 
length of rope that is perfectly flexible, with no bending 
stiffness. This is because if we model such a rope as N 
freely linked pendulums of fixed total length, and then 
let N -* oo, we find that tOmin tends to a finite limit but 
(Umax — oo . The theorem then requires that a — • 0 in 
order to satisfy (1) and, therefore, that co — ■ oo in order 
to satisfy (2). 

Even without this difficulty, however, we would be 
a long way from the genuine Indian rope trick, as 
described down the generations, for this would involve 
making a small boy climb up the vibrating apparatus 
before disappearing at the top. 
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VI. 5 Insect Flight 

Z. Jane Wang 


1 How Do Insects Fly? 

All things fall. Apples and leaves fall to the ground, 
the moon falls toward the Earth, and the Earth toward 
the sun. We would fall too if we did not will ourselves 
to stand upright. To walk, we lift one leg and let our- 
selves fall again. Why any organism should decide to 
fight against the inevitable fate of falling is a mystery 
of evolution. But there they are, insects and birds, flying 
in the face of gravity. 

What insects and birds have discovered are ways to 
push the air around them; and the air, in turn, pushes 
the Earth away. How do insects flap their wings so as 
to create the necessary aerodynamic forces to hover? 
How do they adjust their wing motion to dart forward 
or to turn? Have they found efficient wing strokes? Can 
we emulate them? 

Insects, millions of species in all, are small creatures, 
but they span a wide range of sizes. One of the largest, 
a hawkmoth, has a wing span of 5 cm and weighs 
1.5 g. One of the smallest, a chalcid wasp, measures 
less than a millimeter and weighs 0.02 mg. The smaller 
they are, the faster they flap their wings. A hawkmoth 
flaps its wings at about 20 Hz and a chalcid wasp at 
about 400 Hz. This inverse scaling of the wing beat fre- 
quency with the wing length implies a relatively con- 
stant wing tip speed, about 1 m s -1 , which holds over 
three orders of magnitude in length scale across all the 
different species. Curiously, lms'isa common speed 
for natural locomotion: we walk and swim at a similar 
pace. 

Each flapping wing creates a whirlwind around itself, 
governed by the Navier-Stokes equations. Insects have 
stumbled upon families of solutions to the Navier- 
Stokes equations that provide their wings with the 
necessary thrust (plate 10). Given the wing size and 
the wing speed, the flow generated by each wing is 
characterized by a Reynolds number (Re) in the range 
10-10 000, which is neither small enough to be in the 
Stokesian regime, where the viscous force dominates, 
nor large enough to be in the inviscid regime, where 
the viscous force is negligible. The interplay of the vis- 
cous and inertial effects, especially near the wing, often 
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leads to unexpected behavior and eludes simple theo- 
ries. Another important aspect of the flow is that it is 
intrinsically unsteady, and this implies that the timing 
is critical. Part of an insect’s skill is to tame the unruly 
flow and to coordinate the timing of its own movement 
with the timing of the flow. 

For instance, dragonflies can adjust the timing of 
their forewings and hindwings during different maneu- 
vers. When hovering, their forewings and hindwings 
tend to beat out of phase to gain stability and save 
power. When taking off, the beats of the two sets of 
wings will be closer to being in phase, and the interac- 
tion of the flow leads to a higher thrust. To stay aloft, 
each wing flaps up and down along an inclined stroke 
plane as if rowing in air. The downstroke has an angle 
of attack of about 60°, while the upstroke has a smaller 
angle of attack, close to 10°. Such an asymmetry results 
in an upward aerodynamic drag that supports much 
of its weight. Fruit flies, on the other hand, use only 
two wings to fly. Their halteres, the much-reduced hind- 
wings, have evolved into a gyroscopic sensor that mea- 
sures the rotational velocity of the body. The wings of 
fruit flies flap back and forth with an angle of attack of 
about 40°, and they support their weight with aerody- 
namic lift, much like a helicopter. The angles of attack 
used by insects, much greater than the 10-15° used by 
an airfoil in steady flight, are determined by the weight 
balance, given the limited wing speed that the insects 
can generate. Associated with the large angle of attack 
is the flow separation. The wings must then take advan- 
tage of dynamic stall, which provides a high transient 
force during the wing beat, in such a flow regime, the 
two flight strategies, asymmetric rowing and symmetric 
flapping, can be similarly effective. 

Can a flapping wing be more efficient than a steady 
translating wing? A generic flapping wing motion is 
almost always less efficient than the optimal steady 
flight. This is not to say that all flapping wing mo- 
tions are iess efficient. Computational optimization of 
Navier-Stokes solutions finds some solutions, rare as 
they are, that are more efficient than the optimal steady 
flight at insect scales. One trick that works to the advan- 
tage of the flapping wing is that it can catch its own 
wake as it reverses, allowing the wing to gain an added 
lift with almost no energy cost. 

2 Solving Navier-Stokes Equations 
Coupled to Flapping Wings 

To understand the nature of unsteady flows and to 
mathematically quantify the aerodynamic forces, flight 


efficiency, and the timing of the wing strokes, it is nec- 
essary to solve the governing Navier-Stokes equations 
coupled to the dynamics of a flapping wing. 

The wing drives the flow, and the flow modulates the 
wing motion. The dynamics of the wing is governed 
by Newton’s equation. The fluid velocity u(x,t) and 
pressure p(x,t) are governed by the conservation of 
momentum and mass of the fluid, the navier-stokes 
equations [llf.23]: 


du 

dt 


+ (u ■ V)w = 


P 


V ■ u = 0, 


where p is the density of the fluid and v is the kine- 
matic viscosity. By choosing a length scale, I, and a 
velocity scale, U, the equation can be expressed in a 
nondimensional form containing the Reynolds number, 
Re = UL/v. The flow velocity is far smafler than the 
speed of sound, and the flow is therefore nearly incom- 
pressible. The coupling between the wing and the fluid 
lies in the no-slip boundary condition at the wing sur- 
face, Mbd = m s , which states that the flow velocity at the 
wing surface is the same as the wing veiocity. 

What kinds of solutions do we expect from these 
coupled partial differential equations? Imagine that a 
wing is detached from the insect and falls with a weight 
attached to it. The dynamics of this passive falling wing 
is governed by the same set of equations described 
above, and the way it falls would reveal a specific solu- 
tion to the governing equations. To view some of these 
solutions, we can drop a piece of paper. It may tum- 
ble and flutter erratically. The falling style of a piece 
of paper depends on its geometry and density. If we 
drop a business card, we would find it tumbling about 
its span axis while drifting away. That is, a card driven 
by its own weight while interacting with air can lead 
to a periodic motion (figure 1). To reason this back- 
ward, we deduce that the observed periodic movement 
of the paper generates a thrust that on average balances 
the weight of the paper. If we tilt the falling trajecto- 
ries so that they move along in a horizontal direction, 
they begin to resemble a forward-flapping flight. These 
periodic motions, though not identical to insect wing 
motions, are in essence similar to the ones that nature 
has found. 

To quantify the flow dynamics and to tease out the 
key elements that are responsible for the thrust, we 
turn to computers and experiments. Much of the inter- 
esting behavior of the flow originates near the sharp 
tips of the wing, yet computational schemes often 
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Figure 1 Periodic motion of free falling paper. From U. Pes- 
avento and Z. J. Wang (2004), Falling paper: Navier-Stokes 
solutions, model of fluid forces, and center of mass eleva- 
tion, Physical Review letters 93(14):144501. 

encounter great difficulty in resolving moving sharp 
interfaces. This is a known difficulty in nearly all fluid- 
structure simulations. To resolve the flows, there is no 
one-size-fits-all scheme. If we are interested in a sin- 
gle rigid wing flapping in a two-dimensional fluid, we 
can take advantage of the two-dimensional aspect of 
the problem. We can use a conformal mapping tech- 
nique to generate a naturally adaptive grid around the 
wing such that the grids are exponentially refined at 
the tip. We can also solve the equation in a frame that 
is comoving with the wing to avoid grid regeneration. 
The movement of the wing then gets translated into far- 
held boundary conditions on the flow held. The confor- 
mal map allows us to simulate the how in a large com- 
putational domain, where the vorticity is small at the 
far held. The solution near the far held can be approx- 
imated analytically, and it can be implemented in the 
far-held boundary conditions. These treatments lead to 
high-order numerical computations to resolve the how 
efficiently. 

A three-dimensional flexible wing needs a different 
treatment. Computational methods may also need to 
handle multiple moving objects. One class of tech- 
niques is based on the idea of the immersed interface 
method and its predecessor, the immersed boundary 
method. In these Cartesian-grid methods, the interfaces 
cut through the grids. The problem of treating the mov- 
ing interfaces therefore becomes the problem of han- 
dling singular forces along the interfaces or the discon- 
tinuities in the fluid variables across each interface. One 
of the challenges with these methods is to go beyond 


first-order accuracy at the interfaces where singulari- 
ties reside. Recent work has shown that with careful 
treatment of the discontinuities across moving inter- 
faces, it is possible to obtain second-order accuracy. 
This makes a difference when resolving the critical part 
of the flow. 

3 How Do Insects Turn? 

Unsteady aerodynamics is one of the many puzzles 
when it comes to understanding insect flight. Flap- 
ping flight, like fixed-wing flight, is intrinsically unsta- 
ble. Without proper circuitry for sensing and control, 
an insect would fall. The same control circuitry that 
insects use for stabilization may also be used for acro- 
batic maneuvers. So how do insects modulate their 
wings to turn? 

We end the article with the example of how a fruit fly 
makes a sharp yaw turn, or a saccade (plate 11). A fruit 
fly beats its wings about 250 times per second, and it 
can make a saccade in about 20 wing beats, or about 
80 ms. The wing beat frequency is too fast for direct 
beat-to-beat control by neurons. Instead, the insects 
have learned to shift their “gear” and wait a few wing 
beats before shifting the gear back. We now have some 
idea of how the “gear shift” for the yaw turn works. The 
wing hinge acts as if it is a torsional spring. To adjust 
its wing motion, the wing hinge shifts the equilibrium 
position of the effective torsional spring, and this leads 
to a slight shift of the angle of attack of that wing. The 
asymmetry of the left and right wings creates a drag 
imbalance that causes the insect to turn. To turn 120° 
degrees, the asymmetry in the wing angle of attack is 
only about 5°. 

The torques acting at the wing base have been com- 
puted using aerodynamic models applied to real-time 
tracking of the three-dimensional wing and body kine- 
matics during free flight, which were Aimed with mul- 
tiple high-speed cameras. The wing and body orienta- 
tions were extracted using computer vision algorithms. 
In the case of fruit flies, this amounts to reconstruct- 
ing three rigid bodies based on the three sets of sil- 
houettes recorded by the cameras. For larger insects, 
one can also add markers to help with tracking. Recent 
progress in tracking algorithms allows for semiauto- 
matic processing of vast amounts of data. Although 
this is still time-consuming, it is a welcome depar- 
ture from the earlier days of working with miles of 
cine films, a technique invented in the late nineteenth 
century. 
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4 From Flight Dynamics 
to Control Algorithms 

In a natural environment, insects are constantly being 
knocked about by wind or visual and mechanical per- 
turbations. And yet they appear to be unperturbed and 
are able to correct their course with ease. The halteres, 
mentioned earlier, provide a fast gyroscopic sensor that 
enables a fruit fly to keep track of its angular rotational 
rate. Recent work has found that when a fruit fly’s body 
orientation is perturbed with a torque impulse, it auto- 
matically adjusts its wing motion to create a corrective 
torque. If the perturbation is small, the correction is 
almost perfect. 

Exactly how their brains orchestrate this is a question 
for neural science as well as for mathematical modeling 
of the whole organism. By examining how insects turn 
and respond to external perturbations, we can begin to 
learn about their thoughts. 
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VI.6 The Flight of a Golf Ball 

Douglas N. Arnold 


A skilled golfer hitting a drive can accelerate his club 
head from zero to 120 miles per hour in the quarter 
of a second before making contact with the ball. As a 
result, the ball leaves the tee with a typical speed of 
175 miles per hour and at an angle of 1 1 ° to the ground. 
From that moment the golfer no longer exercises con- 
trol. The trajectory of the ball is determined by the laws 
of physics. 



Figure 1 The actual trajectory of a 
golf ball is far from parabolic. 


In elementary calculus we learn to model the trajec- 
tory of an object under the influence of gravity. The hor- 
izontal component of its velocity is constant, while it 
experiences a vertical acceleration down toward Earth 
at 32.2 feet per second per second. This results in a 
parabolic trajectory that can be described exactly. Over 
a flat course, a ball traveling with the initial speed and 
launch angle mentioned above would return to Earth at 
a point 256 yards from the tee. In fact, observation of 
golf ball trajectories reveals that their shape is far from 
parabolic, as illustrated in figure 1, and that golfers 
often drive the ball significantly higher and farther than 
the simple formulas from calculus predict, even on a 
windless day. The discrepancy can be attributed to the 
fact that these formulas assume that gravity is the only 
force acting on the ball during its flight. They neglect 
the forces that the atmosphere exerts on the ball pass- 
ing through it. Surprisingly, this air resistance can help 
to increase the range of the ball. 

1 Drag and Lift 

Instead of decomposing the air resistance force vec- 
tor into its horizontal and vertical components, it is 
more convenient to make a different choice of coordi- 
nate directions: namely, the direction opposite to the 
motion of the ball, and the direction orthogonal to that 
and directed skyward (see figure 2). The correspond- 
ing components of the force of air resistance are then 
called the drag and the lift, respectively. Drag is the 
same force you feel pushing on your arm if you stick it 
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Lift 



Figure 2 Air resistance is decomposed into drag and lift. 


out of the window of a moving car. Golfers want to min- 
imize it, so their ball will travel farther. Lift is largely a 
consequence of the back spin of the ball, which speeds 
the air passing over the top of the ball and slows the air 
passing under it. By Bernoulli’s principle, the result is 
lower pressure above and therefore an upward force on 
the ball. Lift is advantageous to golfers, since it keeps 
the ball aloft far longer than would otherwise be the 
case, allowing it to achieve more distance. 

Drag and lift are very much affected by how the air 
interacts with the surface of the ball. In the middle of 
the nineteenth century, when rubber golf balls were 
introduced, golfers noticed that old scuffed golf balls 
traveled farther than new smooth balls, although no 
one could explain this unintuitive behavior. This even- 
tually gave rise to the modern dimpled golf ball. Along 
the way a great deal was learned about aerodynam- 
ics and its mathematical modeling. Hundreds of dif- 
ferent dimple patterns have been devised, marketed, 
and patented. However, even today the optimal dim- 
ple pattern lies beyond our reach, and its discovery 
remains a tough challenge for applied mathematics and 
computational science. 

2 Reynolds Number 

Drag and lift— which are also essential to the design of 
aircraft and ships, the swimming of fish and the flight 
of birds, the circulation of blood cells, and many other 
systems — are not easy to model mathematically. In this 
article, we shall concentrate on drag. It is caused by 
two main sources: the friction between the ball’s sur- 
face and the air, and the difference in pressure ahead 


of and behind the ball. The size and relative impor- 
tance of these contributions depends greatly on the 
flow regime. In the second half of the nineteenth cen- 
tury, George Stokes and Osborne Reynolds realized that 
a single number could be assigned to a flow that cap- 
tured a great deal about its qualitative behavior. Low 
Reynolds number flows are slow, orderly, and laminar. 
Flows with high Reynolds number are fast, turbulent, 
and mixing. 

The Reynolds number has a simple formula in terms 
of four fundamental characteristics of the flow: (1) the 
diameter of the key features (e.g., of the golf ball), 
(2) the flow speed, (3) the fluid density, and (4) the fluid 
viscosity. The formula is simple: the Reynolds number 
is simply the product of the first three of these divided 
by the fourth. This results in a dimensionless quantity: 
it does not matter what units you use to compute the 
four fundamental characteristics as long they are used 
consistently. The viscosity , which enters the Reynolds 
number, measures how thick the fluid is: water, for 
example, is a moderately thin fluid and has viscosity 
5 x 10 -4 lb/ft s, while honey, which is much thicker, 
has a viscosity of 5 in the same units, and pitch, which 
is practically solid, has a viscosity of about 200 000 000. 

Using the diameter of a golf ball (0.14 feet), its speed 
(257 feet per second), and the density (0.74 pounds per 
cubic foot) and viscosity (0.000012 lb/ft s) of air, we 
compute the Reynolds number for a professionally hit 
golf ball in flight as about 220 000, much more than a 
butterfly flying (4000) or a minnow swimming (1), but 
much less than a Boeing 747 (2 000 000 000). 

3 The Mysterious Drag Crisis 

At the very beginning of the twentieth century, as 
the Wright brothers made the first successful air- 
plane flight, aerodynamics was a subject of intense 
interest. The French engineer Alexandre Gustave Eif- 
fel, renowned for his famous tower, dedicated his later 
life to the study of aerodynamics. He built a laboratory 
in the Eiffel tower and a wind tunnel on its grounds 
and measured the drag on various objects at various 
Reynolds numbers. In 1912 Eiffel made a shocking dis- 
covery: the drag crisis. Although one would expect that 
drag increases with increasing speed, Eiffel found that 
for flow around a smooth sphere, there is a paradoxical 
drop in drag as the flow speed increases past Reynolds 
number 200 000. This is illustrated in figure 3. Of great 
importance in some aerodynamical regimes, the drag 
crisis begged for an explanation. 
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Reynolds number 

Figure 3 A smooth sphere moving through a fluid exhibits 
the drag crisis: between Reynolds numbers of approxi- 
mately 200 000 and 300000, the drag decreases as the 
speed increases. 

4 The Drag Crisis Resolved 

The person who was eventually to explain the drag cri- 
sis was Ludwig Prandtl. Eight years before Eiffel dis- 
covered the crisis, Prandtl had presented one of the 
most important papers in the field of fluid dynamics 
at the International Congress of Mathematicians. In his 
paper he showed how to mathematically model flow in 
the boundary layer. As a ball flies through the air, a 
very accurate mathematical model of the flow is given 
by the system of partial differential equations known 
as the navier-stokes equations [III.23]. If we could 
solve these equations, we could compute the drag and 
thereby elucidate the drag crisis. But the solution of the 
Navier-Stolces equations is too difficult. Prandtl showed 
how parts of the equations could be safely ignored 
in certain parts of the flow: namely, in the extremely 
thin layer where the air comes into contact with the 
ball. His equations demonstrated how the air speed 
increased rapidly from zero (relative to the ball) at the 
surface of the ball to the ball speed outside a thin layer 
around the ball surface. Prandtl also described very 
accurately the phenomenon of boundary-layer separa- 
tion, by which higher pressure behind the ball (the pres- 
sure being lower on the top and bottom of the ball, by 
Bernoulli’s principle) forces the boundary layer off the 
ball and leads to a low-pressure trailing wake behind 
the ball, much like the wake left behind by a ship. 
This low-pressure trailing wake is a major source of 
drag. 



Figure 4 Flow past a smooth sphere, clearly exhibiting 
boundary-layer separation and the resulting trailing wake. 
A tripwire has been added to the lower sphere. The result- 
ing turbulence in the boundary layer delays separation and 
so leads to a smaller trailing wake. (Photos from An Album 
of Fluid Motion, Milton Van Dyke.) 

In 1914 Prandtl used these tools to give the following 
explanation of the drag crisis. 

(1) At high speed, the boundary layer become tur- 
bulent. For a smooth sphere, this happens at a 
Reynolds number of about 250 000. 

(2) The turbulence mixes fast-moving air outside the 
boundary layer into the slow air of the boundary 
layer, thereby speeding it up. 

(3) The air in the boundary layer can therefore resist 
the high-pressure air from behind the ball for 
longer, and boundary-layer separation occurs far- 
ther downwind. 

(4) The low-pressure trailing wake is therefore nar- 
rower, reducing drag. 

Prandtl validated this subtle line of reasoning experi- 
mentally by measuring the drag on a sphere in an air 
stream and then adding a small tripwire to the sphere 
to induce turbulence. As you can see in a reproduc- 
tion of this experiment shown in figure 4, the result 
is indeed a much smaller trailing wake. 
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Figure 5 Simulation of flow over dimples. 


5 The Role of Dimples 

The drag crisis means that when a smooth sphere 
reaches a Reynolds number of 250 000 or so, it expe- 
riences a large decrease in drag and can travel farther. 
This would be a great boon to golfers were it not for one 
fact: a golf ball-size sphere would need to travel over 
200 miles per hour to achieve that Reynolds number, 
a speed that is not attained in golf. So why is the drag 
crisis relevant to golfers? The answer lies in the dim- 
ples. Just as a tripwire can be added to a smooth sphere 
to induce turbulence and precipitate the drag crisis, 
so can other perturbations of the surface. By suitably 
roughening the surface of a golf ball, e.g., by adding 
dimples, the Reynolds number at which the drag cri- 
sis occurs can be lowered to about 50 000, well within 
the range of any golfer. The resulting drag reduction 
doubles the distance flown by the ball over what can be 
achieved with a smooth ball. 

6 Stalking the Optimal Golf Ball 

As we have seen, dimples dramatically affect the flight 
of a golf ball, so a natural question is how to design 
an optimally dimpled ball. How many dimples should 
there be and in what pattern should they be arranged? 
What shape of dimple is best: round, hexagonal, tri- 
angular, ..., some combination? What size should they 
be? How deep and with what profile? There are count- 
less possibilities, and the thousands of dimple patterns 
that have been tested, patented, and marketed encom- 
pass only a small portion of the relevant design space. 
Modern computational science offers the promise that 
this space can be explored in depth with computa- 
tional simulation, and indeed great progress has been 
made. For example, in 2010 a detailed simulation of 
flow over a golf ball with about 300 spherical dimples at 
a Reynolds number of 1 10 000 was carried out by Smith 


et al. (2012). The computation was based on a finite- 
difference discretization of the Navier-Stokes equa- 
tions using about a billion unknowns and it required 
hundreds of hours on a massive computing cluster to 
solve. It furnished fascinating insights into the role of 
the dimples in boundary-layer detachment and reat- 
tachment, hinted at in figure 5. But even such an 
impressive computation neglects some important and 
difficult aspects, such as the spin of the golf ball, and 
once those issues have been addressed the coupling 
of the simulation to effective optimization procedures 
will be no small task. The understanding of the flight of 
a golf ball has challenged applied mathematicians for 
over a century, and the end is not yet in sight. 

Further Reading 

Smith, C. E., N. Beratlis, E. Balaras, K. Squires, and M. 

Tsunoda. 2012. Numerical investigation of the flow over 

a golf ball in the subcritical and supercritical regimes. 

International Journal of Heat and Fluid Flow 31:262-73. 


VI. 7 Automatic Differentiation 

Andreas Griewank 


1 From Analysis to Algebra 


In school, many people have suffered the pain of having 
to find derivatives of algebraic formulas. As in some 
other domains of human endeavor, everything begins 
with just a few simple rules: 

(u + cv)' = u' + cv', (uv)' = u'v + uv'. (1) 


With a constant factor c, the first identity means that 
differentiation is a linear process’, the second identity 
is known as the product rule. Here we have assumed 
that u and v are smooth functions of some variable x, 
and differentiation with respect to x is denoted by a 
prime. Alternatively, one writes u' = u' (x) = du/dx 
and also calls the derivative a differential quotient. To 
differentiate composite functions, suppose the inde- 
pendent variable x is first mapped into an intermedi- 
ate variable z = fix) by the function /, and then z is 
mapped by some function g into the dependent vari- 
able y. One then obtains, for the composite function 
y = h(x) = gif lx)), 


(i'(x)=/(/(x))/'(x) = £ = ^. (2) 

This expression for h' lx) as the product of the deriva- 
tives g’ and /' evaluated at z = fix) and x, respec- 
tively, is known as the chain rule. One also needs to 
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know the derivatives op' (x) of all elemental functions 
cp(x), such as sin'(x) = cos(x). 

Once these elemental derivatives are known, we can 
combine them using the three rules mentioned above 
to differentiate any expression built up from elemental 
functions, however complicated. In other words, we are 
then doing algebra on a finite set of symbols represent- 
ing elemental functions in one variable and additions 
or multiplications combining two variables. In principle 
this is simple, but as one knows from school, the size of 
the resulting algebraic expressions can quickly become 
unmanageable for humans. The painstaking execution 
of this differentiation by algebra task is therefore best 
left to computers. 

2 From Formulas to Programs 

Even the set of undifferentiated formulas describing 
moderately complicated systems like a robot arm, a 
network of pipes, or a grinding process for mechanical 
components may run over several pages. People there- 
fore stop thinking of them as formulas and instead call 
them programs, coded in languages such as Fortran and 
C, or in systems such as Mathematica and Maple. 

This is particularly true when we have many, say 
n » 1, independent (input) variables Xj and m » 1 
dependent (output) variables yi. For the sake of nota- 
tional simplicity, we collect them into vectors x = 
(Xj) e R” and y = (yi) e R m . The computer program 
will typically generate a large number of quantities v k, 
k = which we may interpret and thus differen- 

tiate as functions of the input vector x. We can include 
all the quantities of interest by defining Vk = Xk+n for 
k = 1 - n, . . . , 0 and V£_ m+k = yk for k = 1 m. 

Assuming that there are no branches, we find that 
the Vi with i > 0 are computed through a fixed, finite 
sequence of F assignments: 

Vi = Vj°Vk or vt = qpi(Vj), i = 1, ... ,11. (3) 

Each operation ° is an addition, + , or a multiplication, 
*, and 

qpi G {c, rec, sqrt, sin, cos, exp, log, . . . } (4) 

is an elemental function from a given library, including 
a constant setting v = c, the reciprocal v = rec (u) = 
1 /u, and the square root v = sqrt (u) = -Ju. Note that 
a subtraction u - w = m + (-1) * w can be performed 
as an addition after multiplication of the second argu- 
ment by -1, and a division u/iv = u * rec(iu) can 
be performed as a multiplication after computing the 
reciprocal of the second argument. 


Assuming that there are no cyclic dependencies, we 
may order the variables such that it always holds that 
j < i and k < f in (3). Hence, with respect to data depen- 
dency the Vi form the nodes of a directed acyclic 
graph [11.16], an observation that is usually credited to 
the Russian mathematician Kantorovich. All Uj = iq(x) 
for i = 1,...,F will therefore be uniquely defined as 
functions of the independent variable vector x. This 
applies in particular to the dependent variables yi = 
V£_ m+i , so that the program loop actually evaluates 
some vector function y = Fix). 

For example, we may first specify a vector function 
y = Fix) : R 3 —■ R 2 by a formula 

FM= \ x 2^±2fl t S2^l 

l sm(xi) X2 + X3J 

Then we may decompose the formula into the following 
sequence of elemental operations: 


V -2 = Xj, V-i = X 2, Vo = X3 
Vi = V -2 * V -2 

V2 = sin(V-2) 
v 3 = V-i + V 0 
V4 = I'l * V3 
V5 = COS(u_2 ) 

ve = l/v 2 

V7 = V4 * ve 

v & = v 5 * V6 
yi = v 7 , y 2 = v 8 


Of course, there may be several such evaluation pro- 
grams yielding the same mathematical mapping F, and 
they may differ significantly with respect to efficiency 
and accuracy. While computer algebra packages may 
attempt to simplify or optimize the code, in the auto- 
matic differentiation approach described below, one 
generally unquestioningly accepts the given evaluation 
program as the proper problem specification. 

3 Propagating Partial Derivatives 

So far we have tacitly assumed that one wishes to com- 
pute the derivatives of all dependents jy- with respect to 
all Xj. These form a rectangular m xn matrix, the so- 
called Jacobian F' (x) , which can be quite costly to eval- 
uate. Each column of the Jacobian is the partial deriva- 
tive of Fix) with just one component xj of x consid- 
ered to be variable. Similarly, each row of the Jacobian is 
the gradient of just one component yt = Fi with respect 
to the complete vector x. Therefore, we will consider 


<Pi = * 
qp 2 = sin 
<P3 = + 

qp 4 = * 

qps = cos 
cp 6 = rec 
qp 7 = * 

< p& = * 
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the univariate case n = 1 first, and in the next section 
we begin with the scalar-valued case m = 1. 

In celestial mechanics or for a complicated system 
like a mechanical clock, the only really independent 
variable may be the time x\ = t, so that n = 1 arises 
naturally. Then the derivatives y[ of the dependent 
variables yi, and also those z' k of the intermediates z k , 
are in fact velocities. In the general case n = 1, we can 
interpret them as rates of change. 

So how can we compute them? Very simply, by apply- 
ing the differentiation rules (1), (2) for addition, multi- 
plication, and elemental functions, we obtain for i = 
1, . . . ,F one of the three instructions 

v[ = v'j + v' k , v[ = VjVk + vjv' k , v[ = (p' i (Vj)v'j. ( 5 ) 

As is the case for qp{Vj) = sin(uy), evaluating the ele- 
mental derivatives qp’ ( Vj ) usually comes at about the 
same cost as the elementals qp(Vj) themselves. There- 
fore, compared with (3) the propagation of the partial 
derivatives v[ costs just a few additional arithmetic 
operations, so that overall 

OPSiyi ^ 3 0PS{F(x)}. (6) 

Here, y' = F'(x) and OPS counts arithmetic operations 
and memory accesses. When x is in fact a vector x, we 
have to propagate partial derivatives v\dth respect to 
each one of n components to obtain the full Jacobian, 
at a cost of 

OPS{F'(x)} ^ 3nOPS{F(x)}. (7) 

4 Comparison with Other Approaches 

A very similar operation count to the bound (7) is 
obtained if one approximates Jacobians by one-sided or 
central finite differences [11.11], However, this classi- 
cal technique cannot deliver fully accurate approxima- 
tions, that is, with an accuracy equal to the precision 
of the underlying floating-point arithmetic. In contrast, 
applying the differentiation rales in the way described 
here imposes no inherent limitation on the accuracy. 

Sometimes, automatic differentiation is portrayed as 
a halfway house between numerical differentiation by 
differencing and the kind of symbolic differentiation 
performed by computer algebra systems like Maple and 
Mathematica. In fact, though, automatic differentiation 
is much closer to the latter. The main distinction is 
that our derivatives v[ are propagated as floating-point 
numbers at the current point x e M”, whereas fully 
symbolic differentiation propagates algebraic expres- 
sions in terms of the independent variables Xj for 
j = l,...,n. 


By first transforming the original code (3) to (5) 
according to the differentiation rules, we actually per- 
form symbolic differentiation in some way. At the soft- 
ware level, this transition is usually implemented by 
operator overloading or source transformation. Both 
the transformation and the subsequent evaluation of 
derivatives in floating-point arithmetic have a complex- 
ity that is proportional to the complexity of the code for 
evaluating F itself. This is also true for the following 
reverse mode of automatic differentiation. 

5 Propagating Partial Adjoints 

Data fitting and other large-scale computational tasks 
require the unconstrained optimization of a scalar 
function y = F(x), where m = 1 and n » 1. One 
then needs the gradient vector VF(x) = F'(x) e R” 
to guarantee iterative decent. According to (7), its cost 
relative to that of F(x) would grow like the domain 
dimension n. 

Fortunately, just as one can propagate forward the 
scalar derivatives v' of intermediates v, with respect 
to a single independent x, it is possible to propagate 
backward the so-called adjoint derivatives t>; of a sin- 
gle dependent y with respect to the intermediates tq 
through the evaluation loop. More specifically, starting 
from y = V£ = 1 and vt = 0 for 1 - n ^ i < t initially, 
we need to execute for i = f , . . . , 1 the incremental 
operations 

Vj ■■= Vj + -Du v k := v k + v it 

vj := Vj + ViV k , v k :=v k + ViV k , 

Vj := Vj + Viqp'(vj) 

in the case of additions, multiplications, and elemental 
functions, respectively. In all three cases the adjoints 
of the arguments Vj, and possibly v k , are updated by a 
contribution proportional to the adjoint Vt of the value 
Vi. Provided these composite operations are executed 
in reverse order to the original function evaluation, at 
the end one obtains the desired gradient: 

VF ( X ) = X = ( Vj - n ) j = 1 n- 

Moreover, just like in the forward mode one has to do 
just a few additional operations, so that 

OPS{x} ^4 0PS{F(x)}. 

This remarkable result is called the cheap gradient prin- 
ciple. The same scheme can be applied componentwise 
to vector-valued F with m > 1. The resulting Jacobian 
cost is 


OPS{F'(x)} ^4mOPS{F(x)}, 
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which is more advantageous than (7) if there are signif- 
icantly fewer dependents than independents. 

6 Generalizations and Extensions 

We have considered here only the propagation of first 
derivatives in the forward or reverse mode. Naturally, 
there are hybrid procedures combining elements of for- 
ward and backward propagation of derivatives. Some of 
them exploit sparsity in Jacobians and also Hessians, 
i.e., symmetric matrices of second-order partial deriva- 
tives. Automatic differentiation techniques can be used 
to evaluate univariate Taylor polynomials of any degree 
d, with the effort being quadratic in d. These Taylor 
coefficients can then be used to determine derivative 
tensors of any degree and any order. 

Further Reading 

The basic theory of automatic differentiation is laid out 
by Griewank and Walther (2008), and a more computer 
science-oriented introduction is given by Naumann 
(2012). For references to current activities in automatic 
differentiation, check www.autodiff.org. Diverse topics 
including software tool development and specific appli- 
cations are covered in a series of proceedings volumes 
of which Forth et al. (2012) is the most recent. 

Forth, S., P. Hovland, E. Phipps, J. Utke, and A. Walther. 
2012. Recent Advances in Automatic Differentiation. Lec- 
ture Notes in Computational Science and Engineering, 
volume 67. Berlin: Springer. 

Griewank, A., and A. Walther. 2008. Evaluating Derivatives: 
Principles and Techniques of Algorithmic Differentiation, 
2nd edn. Philadelphia, PA: SIAM. 

Naumann, U. 2012. The Art of Differentiating Computer 
Programs: An Introduction to Algorithmic Differentiation. 
Philadelphia, PA: SIAM. 


VL8 Knotting and li n k in g of 
Macromolecules 

Dorothy Buck 

Knot theory has its roots in the natural sciences: Lord 
Kelvin conjectured that elements were composed of 
knotted vortices in the ether. This motivated Peter 
Guthrie Tait to begin classifying knots. Since then, knot 
theory has grown into a rich and fundamental area of 
mathematics with surprising connections to algebra, 
geometry, and mathematical physics. 

Since the 1980s, when knotted and linked deoxyri- 
bonucleic acid (DNA) molecules were discovered, knot 




Figure 1 (a) An electron microscopy image of a knotted 
DNA trefoil, (b) A trefoil knot. (Images courtesy of Shailja 
Pathania.) 

theorists have played a central role in exploring the 
biological ramifications of the topological aspects of 
DNA. This has been a successful symbiotic dialogue: 
problems in biology have suggested novel, interesting 
questions to knot theorists. In turn, new topological 
ideas arose that allowed wider biological problems to 
be considered. 

1 Knots and Links 

A knot is a curve in 3-space whose ends meet and that 
does not self-intersect, i.e., a simple closed curve. A 
planar circle is an example of the trivial knot; see fig- 
ure 1(b) for an example of the simplest nontrivial knot, 
the trefoil, also known as the (2, 3) -torus knot. Several, 
say n, knots can be intertwined together (again with- 
out self-intersections) to form an n-component link. 
Two knots or links are equivalent if it is possible to 
smoothly deform, without cutting and resealing, one 
into the other (more technically, if there is an ambient 
isotopy of 3-space that takes one to the other). 

The fundamental question in knot theory is: when are 
two different-looking knots equivalent? In particular, 
when is a complicated-looking knot actually unknot- 
ted? And when knotted, what is the most efficient 
way to unknot it? Much of 3-manifold topology has 
developed in order to provide ways of telling knots 
apart. 

2 DNA Molecules Can Knot and Link 

Anyone who has untangled their headphone cords after 
retrieving them from their backpack or handbag intu- 
itively understands how a long flexible object con- 
fined to a small volume can become entangled. DNA 
molecules in the cell are the molecular analogue of this 
situation. 
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Figure 2 A DNA trefoil knot. 
(Figure courtesy of Massa Shoura.) 


DNA is composed of two intertwined strands, each 
composed of many repeated units ( nucleotides ), each 
in turn composed of a phosphate group, a sugar, and 
one of four bases (adenine, cytosine, thymine, and gua- 
nine). These two intertwined strands, like a spiral stair- 
case, form the famous double helix of DNA, wrapping 
around an imaginary central axis. Unlike a staircase, 
though, DNA molecules are flexible, and in living cells 
the central axis bends and contorts. In some cases— 
such as with bacterial DNA, chloroplast DNA, and our 
own mitochondrial DNA— the top and bottom of this 
twisted ladder are bonded together, so that the cen- 
tral axis is actually circular or even knotted (figures 1(a) 
and 2). 

When this circular DNA is copied before cell divi- 
sion, the resulting two DNA molecules are linked as 
a (2, n)-torus two-component link, where n is propor- 
tional to the number of nucleotides of the original 
DNA molecule. These knotted and linked molecules 
can be simple or complex: for example, mitochondrial 
DNA from the parasite that causes sleeping sickness 
forms an intricate network of thousands of small linked 
circles that resemble chain mail. 

In the cell, this (circular or linear) DNA is confined to a 
roughly spherical volume whose diameter is many mag- 
nitudes smaller than the length of the molecule itself. 
(For example, in humans, a meter of DNA resides in a 
nucleus whose diameter is 1CT 6 meters.) This extreme 
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confinement means that the DNA axis can writhe (form- 
ing supercoils). As discussed in more detail below, when 
certain enzymes act on circular DNA, they can knot 
or link — or indeed unknot or unlink— DNA molecules, 
converting these supercoils into knot or link nodes (or 
vice versa). 

Linear DNA, such as human chromosomal DNA, is 
often “stapled down” at regular intervals via a pro- 
tein scaffolding. These bulky proteins ensure that if a 
DNA segment in between becomes knotted, it cannot be 
untied. Even linear DNA can therefore exhibit nontrivial 
knotting. 

2.1 DNA Forms Only Certain Knots and Links 

Experimentally, DNA knots and links can be resolved 
in two ways: via electron microscopy or via elec- 
trophoretic migration. DNA molecules have been visu- 
alized by electron microscopy (figure 1(a)). However, 
this process can be laborious and difficult: even deci- 
phering the sign of crossings on an electron micro- 
graph is nontrivial. A much more widespread technique 
to partially separate DNA knot and link types is that 
of agarose gel electrophoresis. Gel electrophoresis is 
straightforward and requires relatively small amounts 
of DNA. The distance a DNA knot or link migrates 
through the gel is proportional to the minimal cross- 
ing number (MCN), the fewest number of crossings with 
which a knot can be drawn. Under standard conditions, 
knots of greater MCN migrate more rapidly than those 
with lesser MCN. However, there are 1 701936 knots 
with MCN ^ 16, so both mathematicians and experi- 
mentalists have been actively developing new methods 
for determining (or predicting) the precise DNA knot 
or link type. They have demonstrated that a very small 
subfamily of knots and links (including the torus knots 
and links above) appear in DNA. 

3 DNA Knotting and l i n ki ng 
Is Biologically Important 

The helical nature of DNA leads to a fundamental topo- 
logical problem: the two strands of DNA are wrapped 
around each other once for every 10.5 base pairs, or 
0.6 billion times in every human cell, and they must 
be unlinked so that the DNA can be copied at every cell 
division. Additionally, in the cell, DNA knots can inhibit 
important cellular functions (such as transcription) and 
are therefore lethal. 
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Unsurprisingly, then, there are a host of proteins, in 
every organism, that carefully regulate this knotting 
and linking. 

4 Unknotting and Unli nk in g DNA 

In organisms as varied as bacteria, mice, humans, 
and plants, there are proteins (type II topoisomerases) 
whose sole function is to unlink and unknot entan- 
gled DNA. These act by performing crossing changes to 
convert nontrivial knots and links into planar circles. 
There has been significant interdisciplinary research 
to understand how these proteins that act locally at 
DNA crossings perform a global reaction (unknotting/ 
unlinking). 

These topoisomerases are major antibiotic and che- 
motherapeutic drug targets. With topoisomerases inac- 
tivated (or stranded in an intermediate state of the reac- 
tion), the resulting trapped linked DNA molecules pre- 
vent proliferation of the bacteria (or cancerous cells) as 
well as being lethal to the original cell. 

5 Knotting and l inkin g DNA 

There are also other important proteins, site-specific 
recombinases, that affect the knot/link type of DNA. 
These site-specific recombinases are standard tools for 
genetically modifying organisms and in synthetic biol- 
ogy. The proteins alter the order of the sequence of the 
DNA base pairs by deletion, insertion, or inversion of 
a DNA segment. While changing DNA knotting or link- 
ing is not the primary function of these site-specific 
recombinases, it is a by-product of the reaction when 
the original circular DNA is supercoiled. 

Using tools from knot theory, mathematicians have 
been able to help biologists better understand the ways 
in which these proteins interact with DNA. For exam- 
ple, mathematicians have developed models of how 
the recombinase proteins reshuffle the DNA sequence. 
These models can then predict various new features 
of these interactions, such as the particular geomet- 
ric configuration the DNA takes when the protein is 
attached or the biochemical pathway the reaction pro- 
ceeds through. 

6 Mathematical Methods of Study 

As discussed above, topology has been used to under- 
stand topoisomerases and recombinases and their 
interactions with DNA. The rough character of these 



Figure 3 Two segments (lighter and darker) of DNA in a 
rational tangle. (Figure courtesy of Kenneth L. Baker.) 

arguments is to begin by modeling the circular DNA 
molecule in terms of tangles (a tangle is a 3-ball with 
two properly embedded arcs (figure 3)). The action of 
the protein on the DNA, which converts it from one 
knot type to another, is then represented as tangle 
surgery: replacing one tangle by another. If the tangles 
involved are rational (a tangle obtained from two verti- 
cal arcs by alternately twisting the two arcs horizontally 
then vertically around one another, such as in figure 3), 
then this tangle surgery corresponds to so-called Dehn 
surgery in the corresponding double branch cover. 
Questions about which tangles are involved then turn 
into questions about Dehn surgery, often yielding lens 
spaces. These latter questions are central to 3-manifold 
topology, and a rich array of methods have been devel- 
oped to answer them, including, most recently, knot 
homologies. 

7 Knotted Proteins 

We conclude by discussing one other family of macro- 
molecules that can become entangled. While almost 
all proteins are linear biopolymers, the rigid geom- 
etry of a folded protein means that entanglement can 
become trapped. These entanglements can be knots, 
slipknots, or ravels. Originally thought to be experi- 
mental artifacts, many “knotted” proteins have recently 
been discovered in the Protein DataBase of the National 
Center for Biotechnology Information. The function of 
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these knots is still being fully explored, but they may 
contribute to thermal and/or mechanical stability. 

If the ends of the linear entangled protein chain are 
at the surface of the folded protein, then joining them 
together to characterize the underlying knot is fairly 
straightforward. However, if these ends are buried, 
then depending on the closure one can achieve a vari- 
ety of knots. Characterizing this entanglement has thus 
become a new focus for interdisciplinary researchers, 
including knot theorists. 

Further Reading 

Buck, D. 2009. DNA topology. In Applications of Knot 
Theory /, edited by D. Buck and E. Flapan, pp. 47-79. 
Providence, RI: American Mathematical Society. 

Buck, D., and K. Valencia. 2011. Characterization of knots 
and links arising from site-specific recombination on 
twist knots. Journal of Physics A 44:045002. 

Hoste, J., M. Thistlethwaite, and J. Weeks. 1998. The first 
1,701,936 knots. Mathematical Intelligencer 20:33-48. 
Maxwell, A., and A. Bates. 2005. DNA Topology. Oxford: 
Oxford University Press. 

Millett, K., E. Rawdon, A. Stasiak, and J. Sulkowska. 2013. 
Identifying knots in proteins. Biochemical Society Trans- 
actions 41:533-37. 


VI.9 Ranking Web Pages 

David F. Gleich and Paul G. Constantine 

Google’s search engine enables Internet users around 
the world to find web pages that are relevant to 
their queries. Early on, Google distinguished its search 
algorithm from competing methods by combining a 
measure of a page’s textual relevance with a mea- 
sure of the page’s global importance; this latter mea- 
sure was dubbed PageRank. Google’s PageRank scores 
help distinguish important pages like Purdue Univer- 
sity’s home page, www.purdue.edu, from its array of 
subpages. 

If we view the web as a huge directed graph, then 
PageRank scores are the stationary distribution of a 
particular MARKOV chain [11.25] on the graph. In the 
web graph, each web page is a node and there is a 
directed edge from node i to node j if web page i has a 
hypertext reference, or link, to web page j. A small sam- 
ple of the web graph from Wikipedia pages is shown in 


figure 1. The adjacency matrix [11.16] for the graph 
from the figure is 

0 1 1 1 1 0 1 0 0 0] PageRank 

1000000000 Google 

0000111000 Adjacency matrix 
1100110100 Markov chain 

1111000010 Eigenvector 

0010001011 Directed graph 
0000010001 Graph 

0000000010 Linear system 

0000000100 Vector space 

0 00000000 0 Multiset 

Google’s founders Sergey Brin and Larry Page imag- 
ined an idealized web surfer with the following behav- 
ior. At a given page, the surfer flips a coin with prob- 
ability « of heads and probability 1 - a of tails. On 
heads, the surfer clicks a link chosen uniformly at ran- 
dom from all the links on the page. If the page has no 
links, then we call it a dangling node. On tails, and at 
dangling nodes as well, the surfer jumps to a page cho- 
sen uniformly at random from the whole graph. (There 
are alternative models for handling dangling nodes.) 
This simple model of a random surfer creates a Markov 
chain on the web graph; transitions depend only on the 
current page and not the web surfer’s browsing history. 
The PageRank vector is the stationary distribution of 
this Markov chain, which depends on the value of a. 
For the graph in figure 1, the transition matrix P for 
the PageRank Markov chain with a = 0.85 is (to two 
decimal places) 

0.02 0.19 0.19 0.19 0.19 0.02 0.19 0.02 0.02 0.02] PR 

0.86 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 Go 

0.02 0.02 0.02 0.02 0.30 0.30 0.30 0.02 0.02 0.02 AM 

0.19 0.19 0.02 0.02 0.19 0.19 0.02 0.19 0.02 0.02 MC 

n 0.19 0.19 0.19 0.19 0.02 0.02 0.02 0.02 0.19 0.02 Ei 

r 0.02 0.02 0.23 0.02 0.02 0.02 0.23 0.02 0.23 0.23 DG 

0.02 0.02 0.02 0.02 0.02 0.44 0.02 0.02 0.02 0.44 Gr 

0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.86 0.02 LS 

0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.86 0.02 0.02 VS 

_o.io o.io o.io o.io o.io o.io o.io o.io o.io o.ioj MS 

The PageRank vector x is 

0.081 PR 
0.05 Go 
0.06 AM 
0.04 MC 
0.06 Ei 
' Y ” 0.07 DG 
0.07 Gr 
0.24 LS 
0.25 VS 
0.06J MS 
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Figure 1 A subgraph of the web from Wikipedia. 


To define the PageRank Markov chain generally: 

• let G = (V,E) be the web graph; 

• let A be the adjacency matrix; 

• let n = | V\ be the number of nodes; 

• let D be a diagonal matrix where Du is 1 divided by 
the number of outlinks for page i, or 0 if page i has 
no outlinks; 

• let d be an n-vector where dt = 1 if page i is 
dangling with no outlinks, and 0 otherwise; and 

• let e be the n-vector of all ones. 

Then 

P = <xDA + a/n ■ de r + (1 - <x)/n ■ ee T 

is the PageRank Markov chain’s transition matrix. The 
PageRank vector x is the eigenvector of P r with eigen- 
value 1: 

P T x = x. 

This eigenvector always exists, is nonnegative, and is 
unique up to scaling because the matrix P is irreducible. 
By convention, the vector x is normalized to be a prob- 
ability distribution, e T x = 1. Therefore, x is unique for 
a given web graph and 0 < tx < 1. The vector x also 
satisfies the nonsingular linear system 

(I - ix(A J D + l/n ■ ed r ))x = (1 - a) In ■ e. 

Thus, PageRank is both an eigenvector of the Markov 
chain transition matrix and the solution of a nonsingu- 
lar linear system. This duality gives rise to a variety of 
efficient algorithms to compute x for a graph as large 
as the web. As the value of a tends to 1, the matrix 
I - a(A T D + l/n ■ ed T ) becomes more ill-conditioned, 
and computing PageRank becomes more difficult. How- 
ever, when a is too close to 1, the quality of the ranking 


degrades. For the graph in figure 1, as a tends to 1, 
the PageRank vector concentrates all its mass on the 
pair “Linear system” and “vector space.” This happens 
because this pair is a terminal strong component of 
the graph. The same behavior occurs in the web graph; 
consequently, the PageRank scores of important pages 
such as www.purdue.edu are extinguished as a — ■ 1. 
We recommend 0.5 ^ a ^ 0.99 and note that a = 0.85 
is a standard choice. 

The canonical algorithm for computing PageRank 
scores is the power method [IV.10 §5.5] applied to the 
eigenvector equation P T x = x with the normalization 
e T x = 1. If we start with x <0) = e/n and iterate, 

X (k+D _ (x X Y Dx { ' k ' 1 + a(d T x (fc) )/n ■ e + (1 - a)/n ■ e, 

then after k steps of this method, ||x (fc) - x||i ^ 2a k . 
This iteration converges quickly when a ^ 0.99. With 
a = 0.85, it gives a useful approximation with 10-15 
iterations. 

PageRank was not the first web ranking to use 
the structure of the web graph. Shortly before Brin 
and Page proposed PageRank, Jon Kleinberg proposed 
HYPERLINK-INDUCED TOPIC SEARCH [1.1 §3.1] (HITS) 
scores to estimate the importance of pages in a query- 
dependent subset of the web. HITS scores are com- 
puted only on a subgraph of the web graph, with the top 
1000 textually relevant pages and all inlink and outlink 
neighbors within distance 2. The left and right domi- 
nant singular vectors [11.32] of the adjacency matrix 
are hub and authority scores for each page, respec- 
tively. An authority is a page with many hubs pointing 
to it, and a hub is a page that points to many authori- 
ties. The now-defunct search engine Teoma used scores 
related to HITS. 

Modern search engines use complex algorithms to 
produce a ranked list of web pages in response to a 
query; scores such as PageRank and HITS may be one 
component of much larger systems. To judge the qual- 
ity of a ranking, search engine architects must compare 
algorithmic results with human judgments of the rele- 
vance of pages to a particular query. There are a few 
common measures for such comparisons. Precision is 
the percentage of the search engine’s results that are 
relevant to the human. Recall is the percentage of all 
relevant results identified by the search engine. Recall 
is not often used for web search as there are often many 
more relevant results than would fit on a top 10, or 
even a top 1000, list. Normalized discounted cumulative 
gain is a weighted score that rewards a search engine 
for placing more highly relevant documents before less 
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relevant documents. Architects study these measures 
over a variety of queries to optimize a search engine 
and choose components of their final ranking proce- 
dure. Two active research areas on ranking algorithms 
include new types of regression problems to automat- 
ically optimize ranked lists of results and multiarmed 
bandit problems to generate personalized rankings for 
both web search engines and content recommendation 
services like Netflix. 

Further Reading 

Langville and Meyer’s book contains a complete treat- 
ment of mathematical web ranking metrics that were 
publicly available in 2005. Manning et al. give a modern 
treatment of search algorithms including web search. 
While PageRank’s influence in Google’s ranking may 
have decreased over time, its importance as a tool 
for finding important nodes in a graph has grown 
tremendously. PageRank vectors for graphs from biol- 
ogy (Singh et al.), chemistry (Mooney et al.), and ecol- 
ogy (Allesina and Pascual) have given domain scientists 
important new insights. 

Allesina, S., and M. Pascual. 2009. Googling food webs: can 
an eigenvector measure species’ importance for coextinc- 
tions? PLoS Computational Biology 5:el000494. 

Langville, A. N., and C. D. Meyer. 2006. Google's PageRank 
and Beyond. Princeton, NJ: Princeton University Press. 
Manning, C. D., P. Raghavan, and H. Schiitze. 2008. An Intro- 
duction to Information Retrieval. Cambridge: Cambridge 
University Press. 

Mooney, B. L., L. R. Corraies, and A. E. Clark. 2012. Molec- 
ulaRnetworks: an integrated graph theoretic and data 
mining tool to explore solvent organization in molecu- 
lar simulation. Journal of Computational Chemistry > 33: 
853-60. 

Singh, R., J. Xu, and B. Berger. 2008. Global alignment 
of multiple protein interaction networks with applica- 
tion to functional orthology detection. Proceedings of the 
National Academy of Sciences of the USA 105:12763-68. 


VI. 10 Searching a Graph 

Timothy A. Davis 


When you ask your smartphone or GPS to find the best 
route from Seattle to New York, how does it find a 
good route? If you ask it to help you drive from Seattle 
to Hawaii, how does it know you cannot do that? If 
you post something on a social network for only your 
friends and friends of friends, how many people can 


see it? These questions can all be posed in terms of 
searching a graph (also called graph traversal). 

For an unweighted graph G = (V,E) of n nodes, stor- 
ing a binary adjacency matrix A in a conventional array 
using 0(n 2 ) memory works well in a graph traversal 
algorithm if the graph is small or if there are edges 
between many of the nodes. In this matrix, ay = 1 if 
(i, j) is an edge, and 0 otherwise. With edge weights, the 
value of ay in a real matrix can represent the weight of 
the edge (i,j), although this assumes that in the prob- 
lem at hand an edge with zero weight is the same as no 
edge at all. 

A road network can be represented as a weighted 
directed graph, with each node an intersection and 
each edge a road between them. The graph is directed 
because some roads are one-way, and it is unconnected 
because you cannot drive from Seattle to Hawaii. The 
edge weight ay represents the distance along the road 
from i to j; all edge weights must therefore be greater 
than zero. 

An adjacency matrix works well for a single small 
town, but it takes too much memory for the millions 
of intersections in the North American road network. 
Fortunately, only a handful of roads intersect at any 
one node, and thus the adjacency matrix is mostly zero, 
or sparse. A matrix is sparse if it pays to exploit its 
many zero entries. This kind of matrix or graph is best 
represented with adjacency lists, where each node i has 
a list of nodes j that it is adjacent to, along with the 
weights of the corresponding edges. 

Searching a graph from a single source node, 5, dis- 
covers all nodes reachable from 5 via path(s) in the 
graph. Nodes are marked when visited, so to search the 
entire graph, a single-source algorithm can be repeated 
by starting a new search from each node in the graph, 
ignoring nodes that have already been visited. The 
graph traversal then performs actions on each node vis- 
ited, and also records the order in which the graph was 
traversed. Often, only a small part of the graph needs 
to be searched. This can greatly speed up the solution 
of problems such as route planning. 

Breadth-first search (BFS) is the simplest way to 
search a graph. It ignores edge weights, so it is suited 
only for unweighted graphs. It starts by placing the 
source node s at distance d(s) = 0; the distance of all 
other nodes starts as d(i) = oo. At the kth step (starting 
at k = 0), all nodes i at distance d(i ) = k are examined, 
and any neighbors j with d(j) = oo have their distance 
d( j) set to k+ 1. The process halts when step k finds no 
such neighbors; d(j) is then the length of the shortest 
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path from s to j, or d(j) = oo if there is no such path. 
In a social network, your friends are at level one and 
your friends of friends are at level two in a BFS starting 
at your node. The BFS algorithm can also be expressed 
in terms of the binary adjacency matrix: (A T + I)\ s is 
nonzero if and only if there is a path of length k or less 
from 5 to i. 

Depth-first search (DFS) visits the same nodes as BFS 
but in a different order. If it sees an unvisited node j 
while examining node i, it fully discovers all unvisited 
nodes reachable from j and then backtracks to node i 
to consider the remainder of the nodes adjacent to i. It 
is best described recursively, as below. All nodes start 
out unvisited. 

DFS(t): 

mark i as visited 
for all nodes j adjacent to i do: 
if node j is not visited 
DFS(j) 

Both BFS and DFS describe a tree; i is the parent of 
j if the unvisited node j is discovered while examin- 
ing node i. The DFS tree has a rich set of mathematical 
properties. For example, if “(i" is printed at the start 
of DFS(i) and “i)” when it finishes (after traversing all 
its neighbors j), then the result is an expression with 
properly nested and matching parentheses. The paren- 
theses of two nodes i and j are either nested one within 
the other, or they are disjoint. Many problems beyond 
the scope of this article rely on properties of the DFS 
tree such as this. 

If the graph is stored in adjacency list form, both BFS 
and DFS take an amount of time that is linear in the 
size of the graph: 0(\V\ + |£j), where \V\ and |£| are 
the number of nodes and edges, respectively. 

Examples of a BFS traversal and a DFS traversal of 
an undirected graph are shown in figure 1. The search 
starts at node 1. The levels of the BFS search and the 
parentheses of the DFS are shown. The DFS of fig- 
ure 1 assumes that neighbors are searched in ascending 
order of node number (the order makes no difference 
in the BFS). Both the BFS and DFS trees are shown via 
arrows, which point from the parent to the child in the 
tree; they do not denote edge directions since the graph 
is undirected. 

Dijkstra’s algorithm finds the shortest path from the 
source node s to all other nodes in a weighted graph. It 
is roughly similar to BFS, except that it keeps track of a 
distance d(j ) (the shortest path known so far to each 



(1(2(3(4(551(66)4)3)2)1) 

Figure 1 (a) The BFS and (b) the DFS of an undirected graph. 
Arrows denote tree edges, not edge directions. Part (a) is 
labeled with the BFS level of each node, starting at level 
zero, where the BFS started, and ending at the last node 
discovered (node 5) at level 4. In (b), if “(i” is printed when 
the DFS starts at node i, and “i)” is printed when node i 
finishes, then the result when starting a DFS at node 1 is 
(1 (2 (3 (4 (5 5) (6 6) 4) 3) 2) 1), which describes how the 
DFS traversed the graph. 

node) for each node j. Instead of examining all nodes in 
the next level, it prioritizes them by the distance d and 
picks just one unvisited node i with the smallest d(i), 
whose distance d(i) is now final, and updates tentative 
distance d for all its neighbors. 

The algorithm uses a heap to keep track of its unvis- 
ited nodes j, each with a metric d(j). Removing the 
item with smallest metric takes O(logn) time if the 
heap contains n items. If an item's metric changes but 
it remains in the heap, it takes O(logn) time to adjust 
its position in the heap. Initializing a heap of n items 
takes 0(n ) time. 

Nodes no longer in the heap have been visited, and 
their d(j ) is the shortest path from s to j. The shortest 
path can be traced backward from j to 5 by walking the 
tree from j to its parent p(j), then to p(p(j)), and so 
on until reaching s. 

Nodes in the heap have not been visited, and their 
d(j) is tentative. They split into two kinds of nodes: 
those with finite d (in the frontier), and those with infi- 
nite d. Each node in the frontier is incident on at least 
one edge from the visited nodes. The node in the fron- 
tier with the smallest d(j) has a very useful property on 
which the algorithm relies: its d(j) is the true shortest 
distance of the path from s to j. The algorithm selects 
this node and then updates its neighbors as j moves 
from the frontier into the set of visited nodes. 

The algorithm finds the shortest path from s to all 
other nodes in 0((|V| + |£|)log[V|) time; this asymp- 
totic time can be reduced with a Fibonacci heap, but 
in practice a conventional heap is faster. More impor- 
tantly, the search can halt early if a particular target 
node t is sought, although in the worst case the entire 
graph must still be searched. 
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Dijkstra(G, s): 

set d(j) = oo and p(j) = 0 for all nodes 
d(s) = 0 

initialize a heap for all nodes with metric d 
while the heap is not empty do: 

remove i from heap with smallest dii) 
for all nodes j adjacent to i do: 
if d(j) > d(i) + aij 

d(j) = d(i) + ay- 

adjust j in the heap 

PU ) = i 

For a route from Seattle to New York, Dijkstra’s 
algorithm acts like BFS, searching for New York in 
ever-widening circles centered at Seattle. It considers 
a route through Alaska before finding New York, which 
is clearly nonoptimal. A* search (pronounced A-star) is 
one method that can exploit the known locations of the 
nodes of a road network to reduce the search. It mod- 
ifies the heap metric via a heuristic that uses a lower 
bound on the shortest distance from s to j. To find 
the shortest path from the source node s to all other 
nodes, both algorithms find the same result in the same 
time, but if they halt early when finding the target, A* is 
typically faster than Dijkstra’s algorithm. Your smart- 
phone uses additional methods to speed up the search, 
but they are based on the ideas described here. 

Further Reading 

Kepner, J., and J. Gilbert. 2011. Graph Algorithms in the 
Language of Linear Algebra. Philadelphia, PA: SIAM. 
Rosen, K. H. 2012. Discrete Mathematics and Its Applica- 
tions, 7th edn. Columbus, OH: McGraw-Hill. 


VI. 11 Evaluating Elementary Functions 

Florent de Dinechin and Jean-Michel 
Muller 


1 Introduction 

Elementary (transcendental) functions, such as expo- 
nentials, logarithms, sines, etc., play a central role in 
computing. It is therefore important to evaluate them 
quickly and accurately. 

The basic operations that are easily and efficiently 
implemented on integrated circuits are addition, sub- 
traction, and comparison (at the same cost), and mul- 
tiplication (at a larger cost). The only functions of 
one variable that can be implemented using these 



Figure 1 Taylor approximation is local: it becomes very 
inaccurate for values distant from 0 (solid line, Taylor 
approximation error). 


primitive operations are piecewise polynomials. Sec- 
tion 2 therefore studies the approximation of a func- 
tion by a polynomial. As such, approximations are bet- 
ter on smaller intervals, and section 3 therefore intro- 
duces techniques that reduce the interval size. Sec- 
tion 4 discusses the evaluation of a polynomial by a 
machine. 

If we also consider division to be a primitive opera- 
tion, we may compute rational functions, to which most 
of these techniques can be generalized. Finally, the last 
important parameter is memory, which may be used 
to store tables of precomputed values to speed up the 
evaluation. 

For manipulating multiple-precision numbers (i.e., 
numbers with thousands of digits), special algorithms 
(based on arithmetic-geometric iteration, for instance) 
are used. 

2 Polynomial Approximation 

Consider as an example the approximation of the expo- 
nential function fix) = e x on the interval [0,1] by 
a polynomial of degree 5. We might first try using a 
Taylor approximation. The Taylor formula at 0 gives 

v 2 x 3 x 4 x 5 

fix) » p(x) = l + x+ — + — + — + 120' 

Figure 1 plots the approximation error, i.e., the differ- 
ence fix) - p(x), on [0, 1]. 

Taylor approximation is focused on a point. The 
remez algorithm [IV.9 §3.5] converges to a polyno- 
mial that minimizes the maximum error over the whole 
interval. As figure 2 shows, this is achieved by ensur- 
ing that the error f — p oscillates in the interval, which 
means that the error achieves its maximal absolute 
value at several points on the interval, with the sign 
of the error oscillating from one point to the next. The 
Remez approximation is much more accurate than the 
Taylor approximation. Unfortunately, the Remez algo- 
rithm computes real-valued coefficients, and rounding 
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Figure 2 The Remez error oscillates around zero (solid line), 
but this is lost when its coefficients are rounded to 16-bit 
numbers (dashed line). 



Figure 3 The best polynomial with 16-bit coefficients (solid 
line, Remez approximation error; dashed line, error of best 
16-bit approximation). 


Table 1 ||e A ' - p(x) ||oo versus degree and interval size. 




Degree 


2 

3 

4 

5 

[0,1] 

1.41 x 1(T 2 

8.76 x 10“ 4 

5.07 x 10“ 5 

4.86 x 10“ 6 

[0,i] 

1.08 x 10“ 3 

3.40 x 10“ 5 

8.33 x 10“ 7 

5.27 x 10“ 8 

[0,i] 

1.07 x 10 -4 

1.90 x 10“ 6 

4.57 x 10“ 8 

1.94 x 10“ 9 


these coefficients to machine-representable numbers 
degrades the approximation quality. (This is illustrated 
by figure 2 in the case where the coefficients are 
rounded to 16-bit numbers, corresponding to about 
five significant decimal digits.) A recent variation on 
the Remez algorithm that is implemented in the open- 
source Sollya tool computes the best approximation 
with 16-bit coefficients (see figure 3). 

Whatever the approximation scheme, the approxi- 
mation error decreases with the degree and increases 
with the interval size, as illustrated by table 1. Eval- 
uating polynomials of high degree is time-consuming. 
Also, the evaluation error may increase with the degree. 
This suggests that we investigate range reduction tech- 
niques to reduce the interval size. 


3 Range Reduction 

Range reduction exploits identities that are satisfied 
by the function to reduce the evaluation on an inter- 
val to an evaluation (of /, or of another function) on 
a smaller interval. Techniques depend on the function 
itself but also on the number format used for inputs. 
The literature therefore provides many range reduction 
techniques for each function. 

Let us take the example of the exponential on [0,1], 
for which we will use the identity e a+b = e a x e b . 
An input x with n-bit binary representation x = 
Yd=i Xi2~ l can be decomposed into x = a + b with 
a = Yd= i x;2 _l and b = Y.i=k + 1 *i2~ l for some k. This 
decomposition is free in terms of hardware and costs 
very little— just a few shifts and logical operations— 
in terms of software. We may now tabulate e fl . The 
table will have 2 k entries, and k will be chosen such 
that the table size is acceptable: typically, k ~ 8. Fur- 
thermore, we observe that b e [0, 2 _fc ], and therefore 
e b can be evaluated using a polynomial of much lower 
degree than would be required in the full range [0,1]. 
The reconstruction e x = e a x e b will cost only one 
multiplication. 

With the same input decomposition, another more 
general technique is to tabulate one polynomial p a for 
each value of a, with all the p a having the same degree. 
In this case the reconstruction costs us nothing, as 
f(x) ~ p a (b). However, each table entry is larger. 

Since cos and sin are periodic functions, range reduc- 
tion for these functions consists of deducing, from the 
input variable x, an integer n and a number y such that 

y « x - nn. (1) 

This range reduction step should not be overlooked: a 
naive implementation of (1) may lead to very inaccu- 
rate results for very large arguments. In such cases, 
a sophisticated range reduction technique has been 
suggested by Payne and Hanek. 

4 Evaluation 

Once a suitable approximation polynomial has been 
found, there are several ways of evaluating it. For 
instance, a degree-5 polynomial may be evaluated in 
Horner form: 

p(x) = a o + x(a i + x («2 + x(aj, + x(a± + xas)))). 

This form minimizes the number of multiplications 
in general, but it is sequential. On modern machines, 
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one may improve the performance by exploiting paral- 
lelism. We can write 

p(x) = do + a 2X 2 + CI4X 4 + x(ai + aj,x 2 + CI5X 4 ) 

= Pe(X 2 ) + xp 0 (x 2 ). 

This costs one extra multiplication, but p e (x 2 ) and 
Po(x 2 ) can be evaluated in parallel. This idea can be 
exploited recursively to express more parallelism if 
needed. Function-specific techniques can be used here, 
too: exploiting the fact that the Taylor formula for sine 
has only odd coefficients, for example. 

Further Reading 

Brent, R. P., and P. Zimmermann. 2010. Modern Computer 
Arithmetic. Cambridge: Cambridge University Press. 
Ercegovac, M. D., and T. Lang. 2004. Digital Arithmetic. San 
Francisco, CA: Morgan Kaufmann. 

Higham, N. J. 2002. Accuracy and Stability of Numerical 
Algorithms, 2nd edn. Philadelphia, PA: SIAM. 

Knuth, D. E. 1998. The Art of Computer Programming, 3rd 
edn, volume 2. Reading, MA: Addison-Wesley. 

Muller, J.-M. 2006. Elementary Functions, Algorithms and 
Implementation, 2nd edn. Boston, MA: Birkhauser. 

Payne, M., and R. Hanek. 1983. Radian reduction for trigono- 
metric functions. SIGNUM Newsletter 18:19-24. 


VI. 12 Random Number Generation 

Harald Niederreiter 


The notion of randomness is frequently encountered 
in everyday life. For instance, we view the outcome of 
a fair coin toss as random and unpredictable, and the 
same holds for a throw of dice, a perfect shuffle of a 
card deck, and the drawing of numbers in a lottery. 
Such examples may lead to an intuitive understand- 
ing of random objects (heads and tails, integers from 1 
to 6, permutations of a finite set, bits 0 and 1, or real 
numbers). Grasping randomness in scientific terms is 
much harder and may even prove elusive, depending 
on viewpoints that verge on the philosophical. 

Some theoretical framework for randomness is need- 
ed, however, since there are important methods in com- 
putational and applied mathematics that rely on ran- 
dom objects. A prime example is offered by Monte Carlo 
methods, which can be described as numerical meth- 
ods based on random sampling. The random samples 
usually consist of real numbers or points in a Euclid- 
ean space. In its standard form, a Monte Carlo method 
approximates the expected value of a random vari- 
able (or, in other words, the integral of an integrable 
function) by a sample mean (or, in other words, the 


average value of the integrand taken over a random 
sample). Random objects are also required as inputs in 
probabilistic algorithms that are used in mathematics 
and computer science. Further applications appear in 
computational statistics, operations research, and com- 
puter games, among many other areas. Random bits are 
employed specifically in cryptography in the context of 
fast and powerful schemes for data encryption. 

The problem of generating random objects (be they 
bits, integers from a given interval, or points in a Euclid- 
ean space) can, in most cases, be reduced to that of 
generating random (real) numbers. The task of ran- 
dom number generation presents itself in the following 
form: given a target distribution function F on the real 
line R, generate a sequence of real numbers that sim- 
ulates a sequence of independent and identically dis- 
tributed random variables with distribution function F. 
This task can be simplified by concentrating on an easy 
standardized distribution function: the uniform distri- 
bution function U on R given by U(t) =0 for t < 0, 
!7(f) = f for 0 < t < 1, and [/(f) = 1 for t > 1 . Random 
numbers for which the target distribution function is U 
are called uniform random numbers. Since the proba- 
bility measure corresponding to U is supported on the 
interval [0,1], uniform random numbers can be taken 
from the same interval and they satisfy the property 
that the probability of a uniform random number from 
[0, 1] falling into the subinterval [0, t] is equal to t. 

Random numbers for which the target distribution 
function F is different from U are obtained from uni- 
form random numbers by transformation methods. 
For instance, if F is strictly increasing and continu- 
ous on R and xo.xi,... is a sequence of uniform 
random numbers from the open interval (0,1), then 
F ~ 4 (x 0 ),F^ (xi), . . . can be taken as a sequence of ran- 
dom numbers with target distribution function F. For 
obvious reasons, this transformation method is called 
the inversion method. Many other transformation meth- 
ods have been developed, such as the rejection method, 
the composition method, and the ratio-of-uniforms 
method. 

From a practical point of view, truly random numbers 
obtained, for example, by physical means (such as toss- 
ing coins, throwing dice, or spinning a roulette wheel) 
are too cumbersome because of the need for storage 
and extensive statistical testing. Therefore, users have 
resorted to pseudorandom numbers that are gener- 
ated in a computer by deterministic algorithms with 
relatively few input parameters. In this way, problems 
of storage and reproducibility of the numbers do not 
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arise. Furthermore, well-chosen algorithms for pseudo- 
random number generation can be subjected to rigor- 
ous theoretical analysis that may reduce the need for 
extensive statistical testing of randomness properties. 

In principle, it should be clear that a deterministic 
sequence of numbers cannot pass all possible tests for 
randomness. When using pseudorandom numbers in 
a numerical method such as a Monte Carlo method, 
you should therefore be aware of the specific desir- 
able statistical properties of the random numbers in the 
computational problem at hand and choose pseudo- 
random numbers that are known to pass the corre- 
sponding statistical tests. For instance, if all that is 
needed is the statistical independence of pairs of two 
successive uniform random numbers, then goodness- 
of-fit tests relative to a two-dimensional distribution 
(in this case, the two-dimensional uniform distribution 
supported on the unit square) are quite sufficient for 
this particular purpose. 

For the same reasons as above, we can focus on uni- 
form pseudorandom numbers, that is, pseudorandom 
numbers with the target distribution function U, when 
facing the problem of generating pseudorandom num- 
bers. The history of uniform pseudorandom number 
generation is nearly as old as that of the Monte Carlo 
method and goes back to the 1940s. One of the earliest 
recorded methods is the middle-square method of John 
von Neumann. The algorithm here works with a fixed 
finite precision and an iterative procedure: it takes the 
square of the previous uniform pseudorandom num- 
ber (which can be taken to belong to the interval [0, 1]) 
and extracts the middle digits to form a new uniform 
pseudorandom number in [0,1]. Theoretical analysis 
of this method has shown, however, that the algorithm 
is actually quite deficient, since the generated numbers 
tend to run into a cycle with a short period length. If 
a finite precision is fixed (which is standard practice 
when working with a computer), then periodic patterns 
seem unavoidable when using an iterative algorithm 
for uniform pseudorandom number generation. It is a 
primary requirement of a good algorithm for uniform 
pseudorandom number generation that the period of 
its generated sequences is very long. 

Another early method for generating uniform pseu- 
dorandom numbers that performs much better with 
regard to periodicity properties is the linear congruen- 
tial method, which was introduced by the number theo- 
rist Derrick Henry Lehmer. Here we choose a large inte- 
ger M (usually a prime or a power of 2), an integer g 
relatively prime to M and of high multiplicative order 


modulo M, and, as an initial value, an integer yo with 
1 ^ yo ^ Af - 1 that is relatively prime to M. A sequence 
yo,yi, ... of integers from the set {1, 2, ... ,M — 1} is 
then generated recursively by y n + i = gy n (mod M) 

for n = 0,1 The numbers x n = y n /M for n = 

0,1,... belong to the interval [0, 1] and are called linear 
congruential pseudorandom numbers. The sequence 
xo,xi, ... is periodic with period length equal to the 
multiplicative order of g modulo M. This periodicity 
property suggests that M should be of size at least 2 30 
in practice. Given the modulus M, it is important that 
the parameter g be chosen judiciously. Bad choices of g 
lead to sequences of linear congruential pseudorandom 
numbers that fail simple statistical tests for random- 
ness, and there are in fact infamous cases of published 
generators that used bad values of g. The linear con- 
gruential method is still popular and many recommen- 
dations for good choices of parameters are available in 
the literature. For instance, the GNU Scientific Library 
recommends the CRAY pseudorandom number gener- 
ator, which uses the linear congruential method with 
parameters M = 2 48 and g = 44 485 709 377 909. 

Most of the currently employed methods for uni- 
form pseudorandom number generation use number- 
theoretic or algebraic techniques. A simple extension of 
the linear congruential method, the multiple-recursive 
method, replaces the first-order linear recursion in 
the linear congruential method by linear recursions 
of higher order. This leads to larger period lengths 
for the same value of the modulus M. Another fam- 
ily of methods is formed by shift-register methods, 
where uniform pseudorandom numbers are generated 
by means of linear recurring sequences modulo 2. A 
related family of methods uses vector recursions mod- 
ulo 2 of higher order, and this family includes the 
very popular Mersenne twister MT19937. The MT19937 
produces sequences of uniform pseudorandom num- 
bers with period length 2 19937 - 1 that possess 623- 
dimensional equidistribution up to 32 bits accuracy. 
Furthermore, the generated sequences pass numerous 
statistical tests for randomness. There are also meth- 
ods based on nonlinear recursions, but these methods 
are harder to analyze. 

Further Reading 

Devroye, L. 1986 Non-Uniform Random Variate Generation. 

New York: Springer. 
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VI.13 Optimal Sensor Location in the 
Control of Energy-Efficient 
Buildings 

Jeffrey T. Borggaard, John A. Burns, 
and Eugene M. Cliff 


1 Motivation and Introduction 

Buildings are responsible for more than a third of all 
global greenhouse gas emissions and consume approx- 
imately 40% of global energy. To fully appreciate the 
scale of this problem, consider the fact that a 10% 
reduction in buildings’ energy usage is equivalent to 
all renewable energy generated in the United States 
each year. Moreover, a 70% reduction would be the 
equivalent of eliminating the energy consumption of 
the entire U.S. transportation sector. Building systems 
are highly complex and uncertain dynamical systems. 
Optimization and control of these systems offer unique 
modeling, mathematical, and computational challenges 
if we are to develop new tools for the design, con- 
struction, and operation of energy-efficient buildings. 
In particular, research into new mathematical meth- 
ods and computational algorithms is needed to control, 
estimate, and optimize systems governed by partial 
differential equations (PDEs); research into the model- 
reduction techniques that are required to efficiently 
simulate fully coupled dynamic building phenomena is 
also needed. 

These PDE control and estimation problems must 
be approximated by a “control-appropriate” numerical 
scheme. It is well known that unless these approximate 
models preserve certain system properties, they may 
not be suitable for control or optimization. 

In this article we consider a single problem from this 
large research area: an optimal sensor location prob- 
lem. We show how rather abstract mathematical theory 
can be employed to solve the problem and illustrate 
where approximation theory plays an important role in 
developing numerical methods. We present an example 
motivated by the design and control of hospitals. A U.S. 
Energy Information Administration report has shown 
that large hospitals accounted for less than 1% of all 
commercial buildings in the United States in 2007 but 
consumed 5.5% of the total delivered energy used by 
the commercial sector. Hospitals alone therefore repre- 
sent a significant consumer of energy, and the design, 
optimization, and control of energy use at the scale of 



individual rooms has been the subject of several recent 
studies. 

2 Mathematical Formulation 

Consider the problem of locating a sensor somewhere 
on a wall in the main zone of the hospital suite in fig- 
ure 1 such that the sensor provides the best estimate 
of the temperature field in the entire suite. We assume 
that the room is configured so that the control inlet 
diffusers and outlet vents are fixed and the flow field is 
in steady state. The goal is to find a single thermostat 
location that provides a sensed output that can be used 
by the Kalman filter to optimally estimate the state of 
the thermal conditions inside the room. 

Let Q c l 3 be an open bounded domain with 
boundary 30 of Lipschitz class. Consider an advection- 
diffusion process in the region Q with boundary 3 O 
described by the PDE 

Tt(t,x) + [v (x) - VT(t,x)] = KV 2 T(t,x)+w(t,x), (1) 
with boundary conditions 

T(t,x ) In = b T (x)u , r](x) ■ [K\7T(t,x)]\r 0 = 0. 

Here, T is the state, u is a fixed constant tempera- 
ture at the inflow, fcr(-) is a function describing the 
inflow shape profile, nix) is the unit outer normal, 
is the part of the boundary where the inflow vents are 
located, and To = 30 - Ei is the remaining part of the 
boundary. The function v(x) is a given velocity field, 
and w{t,x) is a spatial disturbance. We assume that 
the disturbance w(t,x) is a spatially averaged random 
field such that 

i v(t,x) = E g(x,y)n(t,y ) d y, 

where p(t, ■) 6 L 2 iO) for all t ^ 0, andg(-, ■) 6 I 2 (fix 
Q) is given; this is a very reasonable assumption for 
problems of this type. 
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We assume that there is a single sensor (a thermostat) 
located at a point q e c'O that produces a local spatial 
average of the state T(t,x). In particular, 

y(t) = fff h(x)T(t,x) dx + oj(t), (2) 
JjJBslqtriQ 

where h(-) e L 2 (Q) is a given weighting function, 
Bg(q) is aball of radius 5 about q, and co (t) is the sen- 
sor noise. Define the output map C(q): L 2 [Q) — > R 1 
by 

C(q)cp(-) = ITf h(x)qp(x) dx, 

JJJ Bs(q)ri{2 

in terms of which the measured output defined by (2) 
has the abstract form 

y(t) = C(q)T(t,-) + (o(t). (3) 

The standard formulation of the distributed parame- 
ter model for the convection-diffusion system (1) with 
output (3) leads to an infinite-dimensional system in 
the Hilbert space 3f = L 2 (Q) given by 

z(t) = Az(t) + Gq(t) e Ilf, 

with output 

y(t) = Ciq)z(t) + wit), 

where the state of the distributed parameter system is 
z(t)(-) = T(t,-) £ Jf = L 2 (0), A is the usual con- 
vection diffusion operator, and G: L 2 {Q) — L 2 (Q) is 
defined by 

[Gi7(-)](x)= ITf g(x,y)q(y) dy. 

JJJo 

Typically, the measurement y(t) produced by the 
thermostat on the wall is the only information available 
to feed into a controller (the heating, ventilation, and air 
conditioning system) to adjust the room temperature. 
However, this information can be used to estimate the 
temperature in the entire room by using a mathemati- 
cal model. The most well-known method for construct- 
ing this approximation is the so-called Kalman filter. In 
this PDE setting, the optimal Kalman filter produces an 
estimate z e (t) of z(t) by solving the system 

z e (t) = Az e (t) + F(q)[y(t) - C(q)z e (t)], 
where the observer gain operator F(q) is given by 
F(q) =Z[C(q)]* 

and the operator Z = Z(q) satisfies the infinite-dimen- 
sional RICCATI OPERATOR EQUATION [III.25] 

AZ + ZA* -Z[C(q)]*C(q)Z + GG* = 0. (4) 

The solution Z(q) is the state estimation covariance 
operator, and the estimation error is given by 

e(J \\z e (s,q) — z(s)\\ln n) dsj = trace(Z(q)), (5) 


where E(y) denotes the expected value of the ran- 
dom variable y and trace(A) denotes the trace of the 
operator A. 

We note that in finite dimensions the trace trace (Z) 
of an operator Z is the usual trace of the matrix. How- 
ever, in infinite dimensions not all operators have finite 
trace, and hence one needs to establish that Z(q) is 
of trace class (i.e., has Unite trace) in order for (5) 
to be finite. One can show that for the problem here, 
Z = Z(q) is of trace class. 

The optimal sensor management problem becomes a 
distributed parameter optimization problem with the 
“state” Z defined by the Riccati system (4) and the cost 
functional defined in terms of the trace of Z by 

J(Z,q) = traceCT) +R(q), 

where Z satisfies (4) and the function R(q) could 
be selected to impose penalties on “bad” regions to 
be avoided. This cost function is selected because 
trace (Z(q)) is the estimation error when the Kalman 
filter is used as a state estimator. We assume that for all 
q e dQ the Riccati equation (4) has a unique positive- 
deflnite solution Z = Z(q). The optimal sensor place- 
ment problem can be stated as the following optimal 
control problem, where we note that the assumption 
that for each q e dO the Riccati equation (4) produces 
a unique Z = Z(q) allows us to introduce the reduced 
functional J (q) =J(Z(q),q). 

The optimal sensor location problem. Find q opt such 
that 

J(q) = J(Z(q),q) = trace (i:(<j)) + R(q) (6) 

is minimized, where Z = Z(q) is a solution of the 
system (4). 

As observed above, one can show that Z(q) is of trace 
class and that the optimization problem is well defined. 
This formulation of the problem allows us to directly 
apply standard optimization algorithms. 

3 Approximation and Convergence 

One of the main issues that needs to be addressed in 
order to develop practical numerical schemes for solv- 
ing the optimal sensor location problem is the approx- 
imation of the Riccati operator equation (4), which is 
required to compute Z = Z(q). Here, C(-) varies con- 
tinuously with respect to the Hilbert-Schmidt norm on 
the set of trace class operators. The practical impli- 
cation of this fact is that trace(Z(-)) also varies con- 
tinuously with respect to q. Thus, we have the basic 
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foundations that allow the application of numerical 
optimization methods. In particular, we make use of 
known results on existence and approximations deal- 
ing with convergence in the space of Hilbert-Schmidt 
operators on 3-f. 

The key observation is that numerical approaches to 
solve the Riccati equation (4) that produce approxima- 
tions Z N (q) of Z(q) must have specific properties if 
one expects to obtain convergence of trace (Z N (q)) to 
traced (q)). 

Writing the Riccati equation (4) as 

AZ+ZA* - ZD(q)Z + F = 0, 

where D(q) = C(q)*C(q) and F = GG * , one clearly 
sees that one must construct convergent approxima- 
tions A n , C N (q), and G N of the operators A, C(q), 
and G, respectively. What is sometimes overlooked is 
that the dual operators also need to be approximated. 
Thus, one also needs to construct convergent approx- 
imations [A*]^, [C*^)]^, and [G*] n of the opera- 
tors A*, C*(q), and G * , respectively. It is important 
to note that, in general, [A^]*, [C N (q)]*, and [G N ]* 
may not converge to A*, C*(q), and G* in a suitable 
topology, and hence convergence of trac e(Z N (q)) to 
trace (Z(q)) can fail. In addition to dual convergence, 
the numerical scheme must preserve some stabilizabil- 
ity and detectability conditions. Again, it is important 
to note that even standard numerical schemes may not 
preserve these important control-system properties. 

Consider a sequence of approximating problems 
defined by (3f N ,A N ,C N {q),G N ), where 3f N c 3f 
is a sequence of finite-dimensional subspaces of 3f, 
and A n g £(3f N ,3f N ), C N (q) G £(3f N ,R 1 ), and 
G n g £(3f N ,3~C n ) are bounded linear operators. Here, 
£(X,y) denotes the usual space of bounded linear 
operators from X to y. In this article we use standard 
finite-element methods so that 3~C N is a Unite-element 
space of piecewise polynomial functions. Let P N : 3f -> 
3f N denote the orthogonal projection of 3f onto 3f N 
satisfying ||P N || ^ 1 and such that \\P N z - z\\ — 0 as 
N — oo for all z G 3-f. For each N = 1,2,3,... consider 
the finite-dimensional approximations of (4) given by 

A n Z n + Z n [A*] n -Z N D N ( q)Z N + F n = 0, (7) 

and assume that (7) has solutions Z N . The approximate 
optimal sensor location problem is now to minimize 

] N (q) = trac e(Z N (q)) +R(q), 

subject to the constraint (7). 

In order to discuss convergence of the finite-dimen- 
sional approximating Riccati operators, we need to 


assume that the numerical scheme preserves the basic 
stabilizability and detectability conditions needed to 
guarantee that the Riccati equation (7) has a unique 
positive-definite solution. 

The required assumptions, which we will not state 
here, break into four distinct conditions concerning 
the convergence of the operators, convergence of the 
adjoint operators, preservation of uniform stabilizabil- 
ity/detectability under the approximation, and com- 
pactness requirements on the input operator. Under 
these assumptions one can show that the Riccati equa- 
tion (7) has a unique nonnegative solution Z N (q) for all 
sufficiently large N and that 

lim trace(A N (q)) = trace(.T(q)). 

N — + oo 

One can also prove that 

lim trac e(Z N (q)P N - P N Z(q)) = 0, 

N— + oo 

which states that the operators Z N (q) converge to Z(q ) 
in the trace norm. For the specific case treated here, one 
can establish that these assumptions hold for standard 
finite-element schemes. 

4 Numerical Results for the Hospital Suite 

For simplicity, we consider the two-dimensional ver- 
sion of the hospital suite shown in figure 2. The “bed 
area” is the large zone on the right, and the remain- 
ing zones are bath and dressing areas. There are four 
inlet vents, and the outflow is into a hall through the 
door on the left. The optimal sensor location problem 
is not very sensitive to the outflow location since the 
flow is dominated by the outflow from the bed area to 
the other two zones. 

For the numerical runs, we assumed a nearly uniform 
disturbance, such that 

w(t,x) = [G c q(t,-)]= (If S £ (x - y)q(t,y)dy, 
JJJn 

where S £ (y) is a smooth approximation of the delta 
function. We also set R(q) = 0. 

In figure 2 we show the inflow, the outflow, and the 
flow through the suite. The asterisks on the north and 
south walls are the points q\ and where the cost 
function j(q) = J{Z(q)) = trace! Z(q)) has local min- 
ima. Figure 3 provides a plot of J(q) = trace(Z(q)) as q 
moves from the northwest corner along the north wall, 
down the east wall, across the south wall, and then up 
the west side of the bed area. Observe that the optimal 
sensor location problem has two solutions: one on the 
north wall between the two inlets and one on the south 
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Figure 2 A two-dimensional hospital suite with four inlets. 


xlO 4 



Figure 3 The value of trace (Z(q)) along the wails. 


wall between the two inlets there. There is also a local 
minimum located in the opening from the bed area to 
the bath area where there is no wall. Although there is 
also a local minimum on the east wall, this minimum is 
nearly the same as the maximum value. 

This example illustrates that the optimal sensor 
placement problem is reasonably well behaved. Even 
though there are multiple local minima, most numer- 
ical optimization algorithms can handle this type of 
problem. The “roughness” near qi and q-i is caused 
primarily by the coarse grid used in the flow solver. 
However, even with the coarse grid used for these cal- 
culations, it is clear that the model is accurate enough 
to roughly identify the neighborhoods where the global 
minimizers are located. This trend has been noticed 
in several other problems of this type and suggests 
that a multigrid-like [IV. 13 §3] approach to the opti- 
mization problem might improve the overall algorithm. 
In particular, we first solve the optimization problem 
on a coarse grid to identify a rough neighborhood of 
the minimizer and then conduct a fine-grid optimiza- 
tion starting in this neighborhood. This idea has been 
successfully applied to similar problems. 

5 Closing Comments 

One goal of this article was to illustrate how a rather 
abstract mathematical framework can be used to for- 
mulate and solve some very practical (optimal) design 
problems that arise in the energy sector. A second take- 
away is that one must be careful when introducing 


approximations to numerically solve the resulting opti- 
mization problem. In particular, additional appropriate 
assumptions are needed to guarantee convergence of 
the operators. 

Another issue concerns the Unite-dimensional opti- 
mization problem itself. Although we do not have the 
space to fully address this issue, it is clear that apply- 
ing modern gradient-based optimization to the approx- 
imate optimal sensor location problem requires that 
the mapping q — [J N iq)] = trac e(S N (q)) be at least 
C 1 . This smoothness requirement also places addi- 
tional restrictions on the types of approximations that 
can be used to discretize the infinite-dimensional sys- 
tem. There are several approaches one might consider 
to approximate the gradient V q [J N (q)]. These range 
from direct finite-difference methods to more advanced 
continuous sensitivity and adjoint methods. In theory 
there is no difference between these methods as long as 
they produce consistent gradients. However, in practice 
the choice can greatly influence the speed and accu- 
racy of the optimization algorithm. We are therefore 
reminded of the following quote attributed to Manfred 
Eigen: 

In theory, there is no difference between theory and 

practice. But, in practice, there is. 
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1 Robot Ki nematics 

Robotics is an interdisciplinary field involving mechan- 
ical and electrical engineering, computer science, and 
applied mathematics that also draws inspirations from 
biology, psychology, and cognitive science. From robots 
in a factory to interplanetary rovers, one of the funda- 
mental capabilities a robot must have is knowledge of 
where it is and how to get to where it needs to go. This 
includes knowing where its appendages are and how to 
move them effectively. While in technical jargon these 
appendages may be called “manipulators” with “end- 
effectors” that directly interact with workpieces, they 
are more commonly referred to as robot “arms” and 

“hands,” especially when the robot links are connected 
end to end in serial fashion. Locating a hand involves 
finding both its position and its orientation. For exam- 
ple, to pick an object out of an open jar, it is impor- 
tant for the hand to not only reach the position of the 


Rp 



object but also to be oriented through the opening of 
the jar. Orientation is fully specified by a 3 x 3 orthog- 
onal matrix, sometimes called a rotation matrix or a 
direction cosine matrix. 

It turns out that the geometric motion characteris- 
tics, that is, the kinematics, of most robots (and most 
mechanical systems in general) are well modeled by 
systems of polynomial equations. In particular, this can 
be seen in the most common element used in mecha- 
nism work: the rotational joint. When a series of links 
are connected end-to-end by rotational joints, each link 
turns circles with respect to its neighbors, and circles 
can be described by polynomial equations. The polyno- 
mial nature of the models allows one to apply algebraic 
geometry to solving kinematics problems. 

Consider a serial-link robot arm with rotationaljoints 
canted at various angles so that the hand maneuvers 
in three-dimensional space, not just in one plane. To 
be more precise, consider first just a single joint and 
let it e I 3 be a unit vector (u ■ u = 1) along its joint 
axis, assumed to be passing through the origin. As illus- 
trated in figure 1, the rotation of an arbitrary vector 
p e R 3 through an angle of 6 around unit vector it is 

R(u,0)p = (u ■ p)u + [p - (u ■ p)u] cos Q+uxp sind, 

where R [u, 0) is a 3 x 3 matrix expression formed from 
the matrix interpretation of the vectorial operations on 
the right-hand side. 

The trigonometric expression for the rotation can be 
converted to an algebraic one by replacing (cos 0, sin 0) 
by (c,s) subject to the unit-circle condition c 1 2 + s 2 = 
1. Abusing notation, we call the reformulated rotation 
matrix R(u,c,s), which in matrix form becomes 

R(u,c,s) = uu T + (I - uu T )c + A(u)s, 
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Figure 2 Serial-link robot schematic. 
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For example, if u = [1 0 0] T , then R(u,c,s) is a rota- 
tion in the yz-plane. We note that R (u, c, s) is linear in 
(c,s). As the joint turns, the vector u defining it stays 
constant while c and s vary, but R{u,c,s) is always 
orthogonal, and its determinant is 1. 

This generalizes to a multilink arm. Consider links 
numbered 0 to N from the robot’s base to its hand, 
connected in series by rotational joints. To formulate 
the position and orientation of the hand with respect 
to the base, we need geometric information about the 
links and their joints. As illustrated in figure 2, mark 
a point Pi on the joint axis between links 0 and 1, and 
similarly mark points P2 , . . . , Pjv on the succeeding axes. 
Also, mark a reference point in the hand as Pjv+i. Next, 
freeze the arm in some initial pose, and at this con- 
figuration let Uj be a unit vector along the axis of the 
joint between link i— 1 and link i. Finally, in the initial 
pose, let the vector from Pj to P, +1 be p* and let the ini- 
tial orientation of the hand be Qo e SO(3). With these 
definitions and the shorthand Ri := R(ui,d,Si), we 
may write the orientation, Q = [Q x Q y Q z ] e SO (3), 
of the hand with respect to the base and the position 
vector, q, from Pi to Pjv as 

Q = R1R2 ■ ■ ■ Rn-iRnQo , ( 1 ) 

4 = Ri (pi + R 2 (p2 + ■ ■ ■ + -Rjv(Pjv) ■■■))■ (2) 


Given the joint rotations (Cj,5j), i = 1 N, one may 

evaluate these expressions to obtain Q and q. This 
solves the forward kinematics problem for serial-link 
arms. 

2 Inverse Kinematics 

Evaluation of the forward kinematics formulas tells us 
where a serial-link robot’s hand is relative to its base. 
More challenging is to reverse this by answering the 
inverse kinematics problem : what joint rotations (c*, Si), 
i = 1 will cause the hand to attain a desired 

location (Q, q)? The space of all rigid-body motions 
is SE(3), a six-dimensional space parameterizable by 
three rotations and three translations. Thus, we need 
N ^ 6 joints to place the hand in any arbitrary position 
and orientation within the working volume of the arm. 

Let us consider the important case of N = 6, where 
we expect to have a finite number of solutions to the 
equations (1), (2) along with 

cf +sf = l, 1 = 1, .... 6. (3) 

Although (1) has nine entries, only three are indepen- 
dent since Q X Q = /. The isolated solutions of the 
system are preserved if one takes three random lin- 
ear combinations of these nine, which with the rest of 
the equations makes a system of twelve polynomials 
in twelve unknowns. By Bezout’s theorem, the number 
of isolated solutions of a system of N polynomials in 
N unknowns cannot exceed the total degree, defined 
as the product of the degrees of the equations. For the 
system at hand, this comes to 6 6 2 6 = 2 985 984. It turns 
out that this upper bound is rather loose. 

The root count can be reduced by algebraically ma- 
nipulating the equations. First, using the fact that 
Rf 1 = Rj, one may rewrite (1), (2) as 

RjRjRjQ = R 4 R 5 R 6 Qo, (4) 

Rj(Rj(Rlq-pi)-p 2 ) 

= P3 + R 4 (p 4 +Rs(P5 + R&P&))- (5) 

These equations are now cubic, reducing the total 
degree to 3 6 2 6 = 46 656. But the equations are 
far from being general cubics because the (Cj,5i) 
pairs each appear linearly. Grouping the unknowns 
into three groups as {ci,Si,C 4,54}, ta, 52, £5,55}, and 
{C3, 53, C6,56l, equations (4), (5) are recognized as being 
trilinear. Based on this grouping of the variables, a vari- 
ant of Bezout’s theorem known as the “multihomoge- 
neous Bezout number” bounds the maximum possible 
number of isolated roots by the coefficient of tx 4 f 4 y 4 
in (a + p + y) 6 (2cx) 2 (2/3) 2 (2y) 2 , i.e., 5760. 
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This is just the beginning of the algebraic manipula- 
tions that one can perform on the way to showing that 
the “six-re volute” (or 6R) inverse kinematic problem 
has at most sixteen isolated roots. 

From an early statement of the problem by Pieper 
in 1968 to a numerical solution by Tsai and Morgan 
in 1985 using continuation through to the first alge- 
braic derivation of an eliminant equation of degree 16 
by Lee and Liang in 1988, this problem was one of the 
top conundrums for kinematicians for twenty years. 
Now, though, powerful computer algorithms can be 
applied to solve the problem in minutes with either 
symbolic computer algebra, based on variants of Buch- 
berger’s algorithm, or numerical algebraic geometry, 
based on continuation methods. In the latter approach, 
no further manipulation of the equations is required, as 
one can set up a homotopy that continuously deforms 
an appropriate start system into a target 6R example. 
Using the variable groups mentioned above, this homo- 
topy has 5760 paths, which can be tracked in parallel 
on multiple processors. The endpoints of these paths 
include the sixteen isolated solutions (real and com- 
plex) of the example 6R problem. Once solved, the gen- 
eral target example can serve as the start system for 
a sixteen-path parameter homotopy to solve any other 
6R inverse kinematic problem. 

3 Generalizations 

In addition to serial-link arms, robots and similar mech- 
anisms can have a variety of topologies, composed 
of serial chains connected together to form closed- 
chain loops. Both forward and inverse kinematics prob- 
lems become challenging, but the kinematics remain 
algebraic and modern algorithms derived from alge- 
braic geometry apply. These mathematical methods for 
kinematic chains also find application in biomechani- 
cal models of humans and animals and in studies of 
protein folding. 

Further Reading 

Raghavan, M., and B. Roth. 1995. Solving polynomial sys- 
tems for the kinematic analysis and synthesis of mech- 
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Solution of Systems of Polynomials Arising in Engineering 
and Science. Singapore: World Scientific. 

Wampler, C. W., and A. J. Sommese. 2011. Numerical alge- 
braic geometry and algebraic kinematics. Acta Numerica 
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VI.15 Slipping, Sliding, Rattling, and 
Impact: Nonsmooth Dynamics 
and Its Applications 

Chris Budd 


1 Overview 

We all live in a nonsmooth world. Events stop and start, 
by accident, design, or control. Once started, mechan- 
ical parts in bearings, bells, and other machines can 
come into contact and impact with each other. Once 
in contact, such parts can slide against each other, 
intermittently sticking and slipping. If these parts are 
rocks, then the result is an earthquake, a system that 
is notoriously hard to predict. While on a small scale 
these systems might be smooth, on a macroscopic 
scale they lose smoothness in one or more derivatives, 
and their motion is best approximated, and analyzed, 
by assuming that they are nonsmooth. To study such 
problems— which may be as simple as a ball bouncing 
on a table or as complex as a collapsing building, or 
even the climate — we need to extend classical dynam- 
ical systems theory to allow for nonsmooth effects. 
Typically, the problems considered, including systems 
with impact, sticking, and sliding, are piecewise smooth 
(PWS) and comprise smooth trajectories interrupted 
by instantaneous nonsmooth events. Such dynamical 
systems are tractable to analysis, which reveals a rich 
variety of behavior, including chaos, new routes to 
chaos, and bifurcations that are not observed in smooth 
dynamical systems. 

2 What Is a Piecewise-Smooth 

Dynamical System? 

Piecewise-smooth systems comprise flows, maps, and a 
hybrid mixture of the two. To define them, let the whole 
system be given by x(t) s Q c R", v\4th Q divided into 
subsets Si, i = 1 , ... ,1V, with the boundary between Si 
and Sj being the surface Ky of codimension p ^ 1. A 
PWS flow, often called a Filippov system, is given by 

(jy 

— = Fi(x) if xe Su (1) 

df 

where F; is smooth in the set Q,. As x crosses the set 
Stj, the right-hand side of (1) will typically lose smooth- 
ness. An example of a Filippov system is a room heater 
controlled by a thermostat that switches a heating ele- 
ment on when the temperature of the room T(t) falls 
below specified temperature To. In this case we have 
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(a) (b) (c) 



Figure 1 (a) Piecewise-smooth flow, 
(b) sliding flow, and (c) hybrid flow. 


two regions in which the operation of the heating sys- 
tem is smooth, and the set X 12 is the surface T - To = 0. 
The trajectories of a Filippov system are themselves 
PWS, losing regularity when they intersect Xy. 

Particularly interesting behavior (illustrated in fig- 
ure 1) occurs on those subsets of Xy for which both 
vector fields Fj and Fj point toward Xy. In this case 
we observe sliding motion in which the dynamical sys- 
tem has solutions that slide along Xy. On the slid- 
ing surface the system obeys a different set of equa- 
tions obtained by certain averages of the vector field to 
give a regularized sliding vector field. Sliding problems 
arise naturally in studying friction and in many types 
of control problems and they are typically studied 
using the theory of differential inclusions, which studies 
differential equations with set-valued right-hand sides. 

A PWS map is defined by 

Xn + 1 = Fj(x„) if Xn 6 Si, (2) 

with a piecewise-linear map given by the special case of 
taking 

Fi(x n ) = AiX n + bi, X = {%: c ■ x = 0}, (3) 


where the At are matrices and the b, are vectors. Maps 
such as (3), often called Feigin maps if they are con- 
tinuous or maps with a gap otherwise, arise naturally 
in their own right, in control problems and in circle 
maps, but they also appear in neuroscience, where the 
condition c ■ x n ^ 0 is a threshold for neurons bring. 
They also occur in the context of Poincare maps for sys- 
tems of the form (1). Despite their apparent simplicity, 
the dynamics of systems of the form (3) is especially 
rich, particularly if the Aj have complex eigenvalues. 
Another such map is the remarkable Nordmark map, 
which arises very often in the context of the grazing 
bifurcations described in the next section. This takes 
the form 


Xn+l — " 


A\X n + hi VC ■ X n , 
A2X n + b 2, 


c ■ x„> 0, 
otherwise. 


(4) 


The square-root term means that close to the line X = 
{x\ c ■ x = 0} the map has unbounded derivatives and 


can introduce infinite stretching in the phase plane. 
It is therefore no surprise that such naturally occur- 
ring maps lead to chaotic behavior over a wide range 
of parameter values. 

Finally, we have hybrid systems [11.18], which are a 
combination of PWS bows in the regions Si and maps 
R: X — X that act when the solution trajectories inter- 
sect the set X (see bgure 1). Hybrid systems arise natu- 
rally in control problems, and are also a natural descrip- 
tion of the many types of phenomena that involve 
instantaneous impacts, such as bouncing balls, rattling 
gears, or constrained mechanical motion such as that 
of bearings. The dynamics of hybrid systems can also 
be described by using measure differential inclusions. 


3 What Behavior Is Observed? 


Many of the phenomena associated with smooth dy- 
namical systems can be found in PWS ones. For exam- 
ple, the systems (1) and (2) can have states such as 
fixed points, periodic trajectories, and chaotic solu- 
tions. If the fixed points are distant from the discon- 
tinuity surfaces Xy or if any trajectories intersect such 
surfaces transversally, then we can apply the implicit 
function theorem to study them, and they have bifurca- 
tions (saddle-node, period-doubling, Hopf) in the same 
manner as smooth systems. The main differences arise 
when, as a parameter varies, these states intersect Xy 
for the first time, so that a fixed point evolves to lie on 
Xy, or a trajectory evolves so that it has a tangential, 
or grazing, intersection with Xy. These intersections 
typically lead to dramatic changes in the behavior of 
the solution in a discontinuity-induced bifurcation. Such 
bifurcations lead to new routes to chaos and many new 
phenomena. As an example consider the system (2) in 
the simplest case when x n is a scalar and for some 
0 < A < 1 we have 


Xn+l — ' 


\x n + p - 1 , 
\x n + p, 


X n ^ 0 , 
x n < 0. 


If p > 1 then the map defined above has a fixed point 
at x* = (p - 1 ) / ( 1 - A). Similarly, if p < 0 the map 
has a fixed point at x* = p/( 1 - A). At p = 1 and at 
p = 0, respectively, one of these two fixed points lies 
on the discontinuity set X = {x: x = 0}. If 0 < p < 1 
we see complex behavior, and a bifurcation diagram 
for the case of A = 0.7 is given in bgure 2. This has 
the remarkable structure of a period-adding cascade in 
which we see a period-(n + m) orbit existing for val- 
ues of the parameter p that lie between the values of p 
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Figure 2 A discontinuity-induced bifurcation leading to 
a period-adding cascade in a piecewise-linear map. 


for which we see period- n and period- m orbits. Period 
adding (and the closely related phenomenon of period 
incrementing) is a characteristic feature of the bifur- 
cations observed in PWS systems. More generally, very 
similar phenomena are observed when periodic orbits 
of PWS flows graze with Zy, leading to grazing bifur- 
cations, which can be studied using the Nordmark map 
(4) and observed experimentally. Even more complex 
behavior is observed in the bifurcations associated with 
the onset of sliding motion. 

Overall, the study of bifurcations in PWS systems is 
still in a relatively early stage of development. Much 
needs to be done both to classify them and understand 
them mathematically, and to study their rich behavior 
in the many applications in which they arise. 

Further Reading 

di Bernardo, M., C. Budd, A. Champneys, and P. Kowalczyk. 
2008. Piecewise-Linear Dynamical Systems: Theory and 
Applications. New York: Springer. 

Guckenheimer, J., and P. Holmes. 1983. Bifurcations and 
Chaos. New York: Springer. 

Kuznetsov, Y. A. 2004. Elements of Applied Bifurcation 
Theory /, 3rd edn. New York: Springer. 


VI.16 From the iV-Body Problem to 
Astronomy and Dark Matter 

Donald G. Saari 


1 A Dark Mystery from Astronomy 

Puzzles coming from astronomy have proved to be 
an intellectual temptation for mathematicians, which 


is why for centuries these two areas have enjoyed a 
symbiotic relationship. The challenge of describing the 
motion of the planets, for instance, led to calculus and 
Newton’s equations of motion, which in turn revolu- 
tionized astronomy and physics. But so many mathe- 
matical results have been found about the Newtonian 
IV -body problem that only a flavor can be offered here. 
To unify the description, topics are selected to indi- 
cate how they shed light on the intriguing astronomical 
mystery of dark matter. 

The dark matter enigma can be seen, for example, in 
the fact that the predicted mass level needed to keep 
galaxies from dissipating vastly exceeds what is known 
to exist. This huge difference between predicted and 
known mass amounts is believed to consist of unob- 
served mass called dark matter. While no current evi- 
dence exists about the nature of this missing matter, 
whatever it may be, it appears to dominate the mass of 
galaxies and the universe. 

What is not appreciated is that mathematics is a 
major player in this mystery story. It must be; the pre- 
dicted mass is a result of a mathematical analysis of the 
Newtonian Af-body problem. A review of the mathemat- 
ics is accompanied with descriptions of other Af-body 
puzzles. 


1.1 The Mass of Our Sun 


Start with something simpler; how can the mass of our 
sun be determined? The answer involves Newton’s laws 
and planetary rotational velocities. Let r(t) be the vec- 
tor position of a planet (with mass to) relative to the 
sun (with mass M). If r = \r | , and G is the gravitational 
constant, Newton’s inverse-square force law requires 
the acceleration r" to satisfy 


,, GMm ( r\ 
mr = r 2 v r ) = 


GMmr 


( 1 ) 


If r is the scalar length of r, then its value is given by 
the dot product r 2 = r r = (r) 2 . By differentiating this 
expression twice and using [r ■ v] 2 + [r x v] 2 = r 2 v 2 , 
where v = r' is the velocity, it follows by substitution 
into (1) that the scalar acceleration satisfies 


mr = - - 


GMm [r x v] 2 


GMm m vl * 
r 


y 2 y3 -y 2 

where v ro t is the planet’s rotational velocity (the v 
component that is orthogonal to r). 

Our Earth has an essentially circular orbit, so r" « 0; 
from (2) we can then deduce that 


M 


(3) 
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Using Earth’s distance from the sun (the r value) and 
the time it takes to orbit the sun (a year), the rotational 
velocity and M, the sun’s mass, can be computed. 

Conversely, (3) asserts that the sun’s specified mass 
M is needed to sustain Earth’s rotational velocity with 
its nearly circular orbit. According to (2), a u r ot larger 
than 

v rot « [GM/r] 1 ' 2 (4) 

would force a larger r" > 0 and introduce the possi- 
bility that Earth might escape from the solar system. 
Luckily for us, this does not happen. 

But there is a potential problem: the planet masses 
are negligible compared with that of the sun, so what 
would it mean if information about a farther out 
planet— Neptune, say— predicted a significantly larger 
mass for the sun? Three natural options are as follows. 

(1) Newton’s law is wrong. 

(2) The larger mass prediction means that Neptune’s 
velocity is too large for the planet to be kept in 
the solar system ((2), (4)). Our system may therefore 
dissipate. 

(3) The predicted mass is there. Rather than reflecting 
the sun’s mass, it is unobserved and hiding between 
the orbits of Earth and Neptune. This is a “solar 
system dark matter” concern. 

Fortunately, we escape the onerous task of selecting 
among these undesired alternatives because Kepler’s 
third law requires the numerator in (3), rv 2 ot , to essen- 
tially be a constant, i.e., the rotational velocities of 
planets decrease to zero as r increases according to 
v r 2 0t = C/r. Kepler’s law, then, offers us reassurance 
that Newton’s law is correct: we will not lose a planet 
(at least in the near future), and there is no solar system 
dark matter concern. 

The intent of this hypothetical scenario is to demon- 
strate the close relationship that connects Newton’s law 
of attraction, mass values, and limits on the rotational 
velocities needed to sustain a stable system. But while 
Kepler’s third law ensures that problems do not arise in 
our solar system, what about galaxies? Could the rota- 
tional velocities of stars be too large to be sustained by 
the amount of known mass? This is discussed next. 

1.2 The Mass of a Galaxy 

The minuscule planetary mass values allow our solar 
system to be modeled as a collection of two-body prob- 
lems; this is what allows us to use (2) to determine 
the sun’s mass. But “two-body” approximations are 


not realistic for a galaxy, with its billions of stars. To 
determine mass values from galactic dynamics would 
require a deep understanding of the behavior of AT-body 
systems in which N is in the billions. Unfortunately, a 
complete mathematical solution is known only for the 
two-body problem, so something else is needed. 

Astrophysicists address this concern in several clever 
ways. Pictures of galaxies, for instance, convey a 
sense of a thick “star soup.” This appearance suggests 
approximating a system of N discrete bodies with a 
continuum model in which the force on a particle is 
determined by Newton’s first and second laws. 

As developed in calculus classes, if a body in a con- 
tinuum symmetric setting is inside a spherical shell, it 
experiences no net gravitational force from the shell. 
If the body is outside a spherical ball, the symmetry 
causes the gravitational force to behave as though the 
ball’s mass is concentrated at its center. With these 
assumptions, an equation can be derived to determine 
M(r), which is a galaxy’s total mass up to distance r 
from the center of mass. This equation, which closely 
resembles (3), is 

M(r) (5) 

There are other ways to predict galactic masses, but 
(5) is often used to justify the amount of mass that is 
needed to keep stars in circular orbits. 

Herein lies the problem: rather than behaving in a 
way that is consistent with Kepler’s third law, where 
Urot values decrease as r increases, observations prove 
that the v ro t values start with a sharp, almost lin- 
ear increase before tapering off to an essentially con- 
stant or increasing value! According to (5), a constant 
v rot means that the predicted mass must grow linearly 
with the distance from the center of the galaxy. Obser- 
vations, however, cannot account for anywhere near 
this much mass! As with the solar system dark mat- 
ter story, this vast difference between predictions and 
observations puts forth uncomfortable options. 

(1) Newton’s laws are incorrect, at least at the large 
light-year distances in a galaxy. 

(2) The velocities are too large for the mass values; 
expect the galaxy to fly apart. 

(3) The difference between the predicted M(r) » Cr 
and the known mass is there but cannot be seen; it 
is due to unobserved “dark matter.” 

Most, but not all, astronomers and astrophysicists 
find the first two options to be unpalatable, so they 
concentrate on the third choice. 
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Figure 1 Discrete-body interactions: (a) interactions, 
(b) central configuration, and (c) spider web. 


2 A Mathematical Analysis 

While astrophysicists attack this concern with observa- 
tions and imaginative experiments, an applied mathe- 
matician would examine the mathematics. Is (5) being 
used properly? Namely, could it be that dark matter is 
a mathematical error in the predicted mass values? The 
following outline of how to analyze these questions is 
intended to encourage others to explore these issues. 

As (5) relies on continuum models to approximate 
Newtonian systems with N discrete bodies, it is natu- 
ral to examine this assumption. Clearly, the behavior of 
“star soup” continuum models need not agree with that 
of systems of N discrete bodies. After all, if two discrete 
bodies on circular orbits come close to each other (fig- 
ure 1(a)), their mutual attraction can dwarf the effects 
of other particles, so neither acceleration is directed 
toward the center of mass. Further violating what con- 
tinuum models require, the faster body can drag along 
the slower one, as suggested by both figure 1 (a) and pic- 
tures of arms of galaxies with the appearance of stars 
being pulled along by others. 

For discrete systems, then, actual rotational veloci- 
ties involve combinations of the total mass M(r) and 
a tugging, rather than just M(r) as required by (5). 
This added pulling requires the stars to have veloci- 
ties larger than those strictly permitted by M(r) : larger 
velocities force exaggerated predicted mass values. The 
mathematical issue is whether this reality of exagger- 
ated mass values invalidates the use of (5). A way to 
analyze this concern is to use analytic Af-body solutions 
in which the precisely known mass value is compared 
with (5) prediction. First, though, other properties of 
IV -body systems are described. 


2.1 Central Configurations 

Interestingly, certain Af-body solutions can perma- 
nently retain the same geometric shape. These shapes 
are central configurations ; they occur when each body’s 
vector position (relative to the center of mass), ry, lines 
up with its acceleration (or force) so that with the same 
negative constant A, 

A r j =r'/, j = (6) 

This balancing scenario (6) requires carefully posi- 
tioned bodies. For example, in 1769 Euler proved that 
there is a unique three-body positioning on a line that 
defines such a configuration. This separation of parti- 
cles depends on the mass values. For intuition, if the 
middle one is too close to one of the end ones, it would 
be pulled toward that end. As this is also true for each 
of the end bodies, the intermediate-value theorem sug- 
gests that there is a balancing position for the middle 
body where it defines a central configuration. Then, in 
1772, Lagrange discovered that the only noncollinear 
three-body central configuration is an equilateral tri- 
angle (figure 1(b)); this is true independent of mass 
values! 

The mutual tugging (6) among the bodies allows each 
body to behave as if it were in a two-body system ((1)). 
Indeed, with appropriate initial conditions, a planar 
central configuration will rotate forever in a circular 
or elliptic manner, while retaining its shape. This hap- 
pens in our solar system; in 1906 the astronomer Max 
Wolf discovered that the sun, Jupiter, and some aster- 
oids, which he named the Trojans, form the rotating 
equilateral triangle configuration seen in figure 1(b)! 

2.2 Rings of Saturn and Spiderwebs 

In the nineteenth century, the mathematician James 
Clerk Maxwell used central configurations to study 
properties of the rings of Saturn. To indicate what he 
did, take a circle and any number, k, of evenly spaced 
lines passing through the circle’s center, i.e., each adja- 
cent pair is separated by the same angle. Place a body 
with mass m at every point where a line passes through 
the circle; these 2k bodies model the particles in a ring. 
By symmetry, the force acting on each body is the same, 
so the system satisfies the central configuration condi- 
tion (6). To include Saturn, place a body with any mass 
value, m* , at the circle’s center r = 0. By symmetry, the 
forces of the 2k bodies cancel, so r” = 0. As AO = 0, 
this {N = 2k + l)-body system satisfies (6). By plac- 
ing this system in a circular motion, Maxwell created a 
dynamical model for a ring of Saturn. 
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As represented in figure 1(c), Maxwell's approach can 
be extended by using any number of circles, say n of 
them. Wherever a line passes through the Jth circle, 
place a mass rrij, j = As each line passes 

through a circle twice, this creates the ( N = 2nk)~ 
body problem. Whatever the mass choices, by using 
an argument similar to the one used for the collinear 
three-body configuration, which involves adjusting the 
radii of the circles, it follows that a balancing point can 
be found where (6) is satisfied. This spiderweb-style 
configuration can be placed in a circular orbit. 

2.3 Unsolved Problems 

Beyond the central configurations described above and 
many others that are known, it remains interesting to 
discover new classes and kinds of central configura- 
tions. This means that we need to know when all of 
them have been found; could there be an infinite num- 
ber of them? Not for three- and four -body problems; the 
Euler and Lagrange solutions are the only three-body 
choices, and Hampton and Moeckel have proved that 
there are a finite number of four -body central configu- 
rations. A fundamental unsolved problem for N ^ 5 is 
whether, for given mass values, there are only a finite 
number of central configurations. 

3 Mass Values 

The main objective is to determine whether (5) can 
be trusted to always provide an accurate, or at least 
reasonably accurate, mass prediction when applied to 
systems of N discrete bodies. For instance, it may be 
acceptable for the predicted value to be twice that of the 
actual value. A way to analyze this question is to select 
mass values for the spiderweb configuration. Remem- 
ber, no matter what mass values are selected, adjusting 
the distances between circles ensures that it defines 
a central configuration. In fact, because a rotation or 
scale change of a central configuration is again that 
same central configuration, let the minimum spacing 
between circles be one unit. 

As closely as it is possible to do with N discrete bod- 
ies, this configuration resembles the symmetric star 
soup continuum setting. By choosing the number of 
lines to be sufficiently large, say fc = 10 000 000, the 
mass distribution is very symmetric. By selecting a 
large n value, say n = 10000, this ( N = 2nk)-body 
problem involves billions of bodies. 


Whatever the choice of mass values m ( , the spider- 
web rotates like a rigid body. This means that there is a 
constant D > 0 such that the common rotational veloc- 
ity of the bodies at distance r is i/ ro t = Dr. The (5) mass 
prediction for constant E > 0 is, therefore, 

M(r)«£Y 3 . (7) 

Equation (7) holds independent of the choice of the 
masses, which allows an infinite number of examples 
to be created. For instance, if the masses on the Jth 
circle are m.j = 1/2 kj, j = 1 ,...,n, then the ring’s 
total mass is 1 /J. A computation shows that the precise 
mass value out to the 5th ring is 

X ~ ln ( 5 )- 

jti J 

Remember, the minimum spacing between circles is 
one unit, so to reach the 5th ring, it must be that r ^ 5. 
Thus, to have the M(s) = 20 mass value, it must be that 
ln(r) ^ ln(5) = 20, or r ^ e 20 . This forces the (7) pre- 
dicted value to exceed [£e 20 ] 3 , which means that the (5) 
prediction exponentially exaggerates the precise value 
by a multiple in the trillions. Clearly, this is unaccept- 
able. Indeed, any consistent multiple larger than 100 
would seriously undermine claims about the existence 
and amount of dark matter. 

As this simple analysis proves that (5) cannot be reli- 
ably used in all situations, the wonderful mathematical 
challenge is to determine when, where, and whether it 
does provide reasonably valid predictions. Of course, 
it follows from Newtonian dynamics, which requires 
a dragging effect, that (5) must exaggerate predicted 
mass values. So the issue is to determine by how much 
in general. An even more interesting mathematical chal- 
lenge is to discover equations that will predict mass 
values for discrete systems that can be trusted. 
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VI.17 The N-Body Problem and the Fast 
Multipole Method 

Lexing Ying 

1 Introduction 

The Af-body problem in astrophysics studies the mo- 
tion of a large group of celestial objects that interact 
with each other. An essential step in the simulation 
of an JV-body problem is the following evaluation of 
the gravitational potentials that involve pairwise inter- 
action among these objects: given a set P c R 3 of N 
stars and the masses {/(y): y e P} of the stars, the 
goal is to compute for each star x e P the gravitational 
potential u(x) defined by 

u(x) = X G(x,y)f(y), 

yeP,y£x 

where G(x,y) = l/|x - y| is the Newtonian gravi- 
tational kernel. Similar computational tasks appear in 
many other areas of physics. For example, in electro- 
magnetics, the evaluation of the electrostatic potentials 
takes exactly the same form, with the kernel G(x,y) = 
l/|x - y | in three dimensions or G(x,y) = ln(l/|x - 
y | ) in two dimensions; evaluation of magnetic induc- 
tion with the Biot-Savart law also takes a similar form. 
In acoustic scattering, a similar computation shows up 
withmore oscillatory kernel functions G(x,y). Finally, 
problems related to heat diffusion often require this 
computation with Gaussian-type kernel G(x, y). 

A direct computation of u(x) for all x 6 P clearly 
takes 0(N 2 ) steps. For many of these applications, the 
number of objects N can be in the millions, if not more, 
so a computation of order 0(N 2 ) can be very time- 
consuming. The fast multipole method, developed by 
Greengard and Rokhlin, computes an accurate approx- 
imate solution in about O(N) steps. The word mul- 
tipole refers to a series expansion that is used fre- 
quently in electromagnetics for describing a field in a 
region that is well separated from the source charges. 
Each term of this multipole expansion is a product of 
an inverse power in the radial variable and a spheri- 
cal harmonic function in the angular variables. Since 
the multipole expansions often converge rapidly, they 
can be truncated after just a few terms— a fact that 
plays a key role in the efficiency of the fast multipole 
method. 



Figure 1 N points quasiuniformly 
distributed in a unit box [0, l] 2 . 


2 Algorithm Description 

To illustrate the algorithm, we consider the two- 
dimensional case and assume that the point set P is dis- 
tributed quasiuniformly inside the unit box Q = [0, l] 2 
(see figure 1). The algorithms for three-dimensional and 
nonuniform distributions are more involved, but the 
main ideas remain the same. 

We start with a slightly simpler problem, where B and 
A are two disjoint cubes of the same size, each contain- 
ing O(n) points. Consider the evaluation at points in A 
of the potentials induced by the points in B, i.e., for 
each x £ Ar\ P, compute 

w(x) = X G(x,y)f(y). 

yeBnP 

Though direct computation of u(x) takes 0(n 2 ) steps, 
there is a much faster way to approximate the calcu- 
lation if A and B are well separated. Let us imagine 
A and B as two galaxies. When A and B are far away 
from each other, instead of considering all pairwise 
interactions, one can sum up the mass in B to obtain 
Ib = XyeBnpf(y) and place it at the center cb of B, 
evaluate the potential ua = G(CA,cs)fB at the center 
ca of A as if all the mass is located at cb, and finally use 
ua as the approximation of the potential at each point 
x in A. A graphical description of this three-step pro- 
cedure is given in figure 2 and it takes only O(n) steps 
instead of O { n 2 ) steps. The procedure works well when 
A and B are sufficiently far away from each other, but it 
gives poor approximation when A and B are close. For 
the time being, however, let us assume that the pro- 
cedure provides a valid approximation whenever the 
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f B = V f(y) U A~ G(c A , c B ) f B 


Figure 2 A three-step procedure that efficiently approxi- 
mates the potential in A induced by the sources in B. The 
computational cost is reduced from 0(n 2 ) to O(n). 



Figure 3 The domain is partitioned with an octree structure 
until the number of points in each leaf node is bounded by 
a small constant. 


distance between A and B is greater than or equal to 
the width of A and B. 

From an algebraic point of view, the three-step pro- 
cedure is a rank-1 approximation of the interaction 
between A and B\ 


G(A,B) 


1 



G(c a ,c b ) [l 


1 ]' 


where G(A,B) is the matrix with entries given by 
(G(x, y)) X eAnP,yeBnP- /a and u B are now the interme- 
diate results of applying this rank-1 approximation. 

This simple problem considers the potential in A only 
from points in B. However, we are interested in the 
interaction between all points in P. To get around this, 
we partition the domain hierarchically with an octree 
structure until the number of points in each leaf box is 
less than a prescribed 0(1) constant (see figure 3). The 
whole octree then has O(logJV) levels, and we define 
the top level to be level 0. At level t, there are 4 I: 
cubes and each cube has 0(JV/4^) points due to the 
quasiuniform point distribution. 

The algorithm starts from level 2. Let B be one of the 
boxes on level 2 (see figure 4(a)). The near field N{B ) 
of B is the union of B and its neighboring boxes, and 
the far field F(B) of B is the complement of the near 
field N(B). For a box A in B’s far field, the computation 



Figure 4 The algorithm at different levels. B stands for a 
source box. A is a target box for which the interaction with B 
is processed at the current level. Dark gray denotes boxes 
for which the interaction has already been considered by 
the previous level. Light gray denotes boxes for which the 
interaction is being considered at the current level. For the 
first three plots, a three-step procedure is used to accelerate 
the interaction between well-separated boxes. For the last 
plot, the nearby interaction is handled directly at the leaf 
level. 


of the potentials at points in A induced by the points 
in B can be accelerated with the three-step procedure. 
There are 4 2 possibilities for B , and for each B there are 
0(1) choices for A (see figure 4(a)). Since both A and 
B contain 0(N/ 4 2 ) points due to the quasiuniformity 
assumption, the cost for all the interaction that can be 
taken care of on this level is 

4 2 • 0(1) ■ 0(JV/4 2 ) = O(N). 

We cannot process the interaction between B and its 
near field on this level, so we go down one level in the 
octree. 

We again use B to denote a cube at level 3 (see fig- 
ure 4(b)). We do not need to consider B’s interaction 
with the far field of B’s parent since it has already been 
taken care of at the previous level. Only the interac- 
tion between B and its parent’s near field needs to be 
considered. At this level, there are at most 6 2 boxes in 
its parent’s near field. Out of these boxes, 27 of them 
are, typically, well separated from B and the interaction 
between B and these boxes can be accelerated using the 
three-step procedure. The set of these boxes is called 
the interaction list of B. Since each box on this level 
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contains 0(iV/4 3 ) points and there are 4 3 possibili- 
ties for B, the total cost for the three-step procedures 
performed on this level is again 

4 3 • 0(1) ■ OUV/4 3 ) = 0(N). 

However, for the interaction of B and the boxes in its 
near field, we need to go down again. 

For a general level £, there are ¥ choices for B (see 
figure 4(c)). For each B, there are at most 2 7 possibilities 
for A. Since each box on this level has 0(N/ 4 { ) points, 
the cost of far-held computation is 

4 f ■ 0(1) ■ 0(N/ ¥) = 0(N). 

Once we reach the leaf level, there is still the interac- 
tion between a leaf box and its neighbors to consider 
(see figure 4(d)). For that, we just use direct computa- 
tion. Since there are 0(N) leaf boxes, each containing 
0 ( 1 ) points and having 0 ( 1 ) neighbors, the total cost 
of direct computation is 

0(N) ■ 0(1) ■ 0(1) = O(N). 

For the three-step procedure between a pair of well- 
separated boxes A and B, it is clear that the first step 
depends only on B and the last step depends only on A. 
Therefore, there is an opportunity for reusing compu- 
tation. Taking this observation into consideration, we 
can write the algorithm as follows. 

(1) For each level f and each box A on level £, set Ua 
to be zero. 

(2) For each level ■£ and each box B on level £, compute 

fB = ZyeBnpf(y)- 

(3) For each level £ and each box B on level I’, and for 
each box A in B’s interaction list, update ua = ua + 
G(ca, c B )f B . 

(4) For each level £ and each box A on this level, update 
u(x) = u(x) + ua for each x 6 An P. 

(5) For each box A at the leaf level, update u(x) = 
u(x)+Z y <=N(A)nP G(x,y)f(y) foreachx e AnP. 

Since the cost at each level is 0(N) and there are 
O(logiV) levels, the whole cost of the algorithm is 
O (IV log IV). 

The question now is whether we can do it in fewer 
steps. The answer is that we can, and it is based on the 
following simple observation. 

Let Bi B 4 be B’s children. Since B = B\ u ■ ■ ■ u £4 

and all the Bi are disjoint, we can conclude that (see 
figure 5(a)) 

fB = fB 1 + /ft + /ft + /ft- 


(b) 


Figure 5 A basic observation that speeds up the compu- 
tation of fg and ua ■ (a) fg can be computed directly from 
/ft, where the Bi are B’s children, (b) Instead of adding ua 
directly to all the points in A, we only add to ua, , where the 
Ai are A’s children. 

Therefore, assuming that the /ft are ready, using the 
previous line to compute fg is much more efficient than 
summing over all fly) in B for large B. 

Similarly, for each A, we update u(x) := u(x) + ua 
for each x e AnP. Assume that Ai , . . . , A4 are the 
children boxes of A. Then, since we perform the same 
step for each A; and each x belongs to one such Aj, we 
can simply update UA t '■= ua , + ua instead, which is 
much more efficient (see figure 5(b)). 

Notice that in order to carry out these two improve- 
ments, we make the assumption that for fg we visit 
the parent after the children, while for ua we visit the 
children after the parent. This requires us to traverse 
the octree with different orders at different stages of 
the algorithm. Putting the pieces together, we reach the 
following algorithm. 

(1) For each level £ and each box A on level £, set u .\ 
to zero. 

(2) Traverse the tree from level L - 1 up to level 0, 
and for each box B, if B is a leaf box, set fg = 
T.yeBnP fly)- If B is not a leaf box, set fg = fg, + 

■ ■ ■ + /ft ■ 

(3) For each level £, for each B, and for each A in B’s 
interaction list, update ua = ua + G(cA,CB)fB- 

(4) Traverse the tree from level 0 down to level L - 1, 
and for each box A, if A is not a leaf box, update 
UAi = UA t + ua for each child Aj of A. If A is a leaf 
box, update u(x) = u(x) + ua- 

(5) For each box A at the leaf level, update u(x) = 
u(x) + ZysN<.A)nP G{x,y)f{y) for eachx e AnP. 

Steps (1), (3), and (5) are the same as in the previous 
algorithm, and their cost is O(N) each. For steps (2) 
and (4), since there are at most 0(N) boxes in the tree 
and the algorithm spends 0(1) steps per box, the cost 
is again O (IV) . As a result, the total cost is, as promised, 
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0(N). This is essentially the fast multipole method 
algorithm proposed by Greengard and Rokhlin. 

3 Algorithmic Details 

We have so far been ignoring the issue of accuracy. In 
fact, if we use only fs and ua as described, the accu- 
racy is low, since A and B can be only one box away 
from each other. Recall that the three-step procedure 
that we have used so far is a poor rank-1 approxima- 
tion of the interaction between A and B. The low-rank 
approximation used in the fast multipole method by 
Greengard and Rokhlin is based on the truncated mul- 
tipole and local expansions. The resulting f B and ua 
are the coefficients of the truncated multipole and local 
expansions, respectively. In addition, there are natu- 
ral multipole-to-multipole operators Tb ,b, that trans- 
form f Bi to f B (fs = Zi Tfi.Bj/fii), local-to-local opera- 
tors Ta u a that take ua to ua { (ua, = wa, + Ta u aU a), 
and multipole-to-local operators Ta,b that take f B to 
ua (ua = ua + Ta.b/b)- We will not go into the details 
of these representation here, except for two essential 
points. 

• For any fixed accuracy, both fs from the multi- 
pole expansions and ua from the local expansions 
contain only 0(1) numbers. 

• The translation operators are maps from 0(1) 
numbers to 0(1) numbers, and applying them 
takes 0(1) steps. A lot of effort has been devoted 
to further optimizing the implementation of these 
operators. 

From these two points it is clear that the overall O (AT) 
complexity of the fast multipole method remains the 
same when more accurate low-rank approximations are 
used. There are many other ways to implement the 
low-rank approximations and the translation operators 
between them. Some examples include the H 2 matri- 
ces of Hackbusch et al., the fast multipole method 
without multipole expansion of Anderson, and the 
kernel-independent fast multipole method. 

Acknowledgments. Part of the text and all of the figures 
are reproduced from Ying (2012) with the permission of the 
publisher. 
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VI. 18 The Traveling Salesman Problem 

William Cook 


The traveling salesman problem, or TSP for short, is a 
prominent model in discrete optimization [IV.38]. In 
its general form, we are given a set of cities and the cost 
to travel between each pair of them. The problem is to 
find the cheapest route to visit each city and to return 
to the starting point. The TSP owes its fame to its suc- 
cess as a benchmark for developing and testing algo- 
rithms for computationally difficult problems in dis- 
crete mathematics, operations research, and computer 
science. The decision version of the problem is a mem- 
ber of the np-hard [1.4 §4.1] complexity class. Applica- 
tions of the TSP arise in logistics, machine scheduling, 
computer-chip design, genome mapping, data analysis, 
guidance of telescopes, circuit-board drilling, machine- 
translation systems, and in many other areas. 

The origin of the TSP’s catchy name is somewhat of a 
mystery. It first appears in print in a 1949 RAND Corpo- 
ration research report by Julia Robinson, but she uses 
the term in an offhand way, suggesting it was a famil- 
iar concept at the time. The origin of the problem itself 
goes back to the early 1930s, when it was proposed 
by Karl Menger at a mathematics colloquium in Vienna 
and by Hassler Whitney in a seminar at Princeton Uni- 
versity. The problem also has roots in graph theory and 
in the study of Hamiltonian circuits. 

1 Exact Algorithms 

A route that visits all cities and returns to the starting 
point is called a four; finding an optimal tour (that is, 
one that is cheapest) is the goal of the TSP. This can be 
accomplished by simply enumerating all permutations 
of the cities and evaluating the costs of the correspond- 
ing tours. For an n-city TSP, this approach requires time 
proportional to n factorial. In 1962, Bellman and the 
team of Held and Karp each showed that any instance 
of the problem can solved in time proportional to n 2 2 n . 
This is significantly faster than brute-force enumera- 
tion, but the algorithm takes both exponential time and 
exponential space. 
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The heart of the Bellman-Held-Karp method can be 
described by a simple recursive equation. To set this up, 
denote the cities by the labels 1 , 2, . . . , n and, for any 
pair of cities (i,j), let Cy denote the cost of traveling 
from city i to city j. We may set city 1 as the fixed 
starting point for the tours; let TV = {2, . . . , n} denote 
the remaining cities. For any S £ TV and for any j e S , 
let opt(S, j) denote the minimum cost of a path starting 
at 1, running through all points in S, and ending at j. 
We have 

opt(5, j) = min(opt(S \ { j},i ) + Cy: i g S\ {j}). (1) 
Moreover, the optimal value of the TSP is 

v* = min(opt(JV, j) + Cjl :jeN). (2) 

Observe that for all j G TV we have opt({J}, j) = Cy. 
Starting with these values, the recursive equation (1) is 
used to build the values opt (S,j) for all S s TV and 
j e S, working our way through sets with two ele- 
ments, then sets with three elements, and step by step 
up to the full set TV. Once we have the values opt(TV, j) 
for all j G TV, we use (2) to find v*. Now, in a sec- 
ond pass, the optimal tour is computed by first iden- 
tifying a city v n -\ such that opt(TV, v n -i) + c Vn _, i = 
v*, then identifying a city v n -2 G TV \ {v n -\} such 
that opt(JV \ {Un-ll, t'n- 2 ) + c Vn _ lVn _ 2 = opt(TV, V n -i), 
and so on until we have v\. The optimal tour is 
(1, vi v n -i). This second pass is to permit the algo- 

rithm to store only the values opt(S,J) and not the 
actual paths that determine these values. 

The running-time bound arises from the fact that in 
an n-city problem there are 2” _1 subsets S that do not 
contain the starting point. For each of these we consider 
at most n choices for the end city j, and the computa- 
tion of the opt (S,j) value involves fewer than n addi- 
tions and n comparisons. Multiplying 2" _1 by n by 2n, 
we have that the total number of steps is no more than 
n 2 2”. 

Despite the attention that the TSP receives, no algo- 
rithm with a better worst-case running time has been 
discovered for general instances of the problem in the 
fifty years plus since the work of Bellman, Held, and 
Karp. It is a major open challenge in TSP research to 
improve upon the n 2 2” bound. 

2 Approximation Algorithms 

The problem of determining whether or not a given 
graph has a Hamiltonian circuit, that is, a circuit that 
visits every vertex of the graph, is also a member of the 
NP-hard complexity class. This problem can be encoded 


as a TSP by letting the cities correspond to the vertices 
of the graph, assigning cost 0 to all pairs of cities cor- 
responding to edges in the graph and assigning cost 
1 to all other pairs of cities. A Hamiltonian circuit has 
cost 0, and any nonop timal tour has cost at least 1. A 
method that returns a solution within any fixed per- 
centage of optimality v\ill therefore provide an answer 
to the yes/no question. Thus, unless NP = P, there can 
be no a-approximation algorithm for the TSP that runs 
in polynomial time, that is, no algorithm that is guar- 
anteed to produce a tour of cost no more than « times 
the cost of an optimal tour. 

An important case of the TSP is when the travel costs 
are symmetric, that is, when the cost of traveling from 
city A to city B is the same as the cost of traveling from B 
to A. A further natural restriction, known as the triangle 
inequality, states that for any three cities A, B, and C, the 
cost of traveling from A to B plus the cost of traveling 
from B to C must not be less than the cost of traveling 
from A directly to C. Under these two restrictions, there 
is a polynomial-time algorithm due to Christofides that 
finds a tour the cost of which is guaranteed to be no 
more than | times the cost of an optimal tour. 

A symmetric instance of the TSP can be described 
by a complete graph G = (V,E) having vertex set 
V = {1 ,...,n} and edge set E consisting of all pairs 
of vertices, together with travel costs (c e : e e E). 
Christofides’s algorithm begins by finding a minimum- 
cost spanning tree T in G; the cost of T is no more 
than the cost of an optimal TSP tour. The algorithm 
then computes a minimum-cost perfect matching M in 
the subgraph of G spanned by the vertices that meet 
an odd number of edges in the tree T ; the cost of M is 
no more than \ times the cost of an optimal TSP tour 
since any tour is the union of two perfect matchings. 
The union of the edges in T and the edges in M form a 
graph H on the vertices V such that every vertex meets 
an even number of edges. Thus H has an Eulerian cycle, 
that is, a route that starts and ends at the same vertex 
and visits every edge of H. The Eulerian cycle can be 
transformed into a TSP tour by traversing the cycle and 
skipping over any vertices that have already been vis- 
ited. The cost of the resulting tour is no more than the 
cost of the cycle due to the triangle inequality. Thus, 
the cost of the tour is no more than \ times the cost of 
the optimal tour. 

Christofides proved his | -approximation result in 
1976. Again, despite considerable attention from the 
research community, the factor of § has not been 
improved upon for the case of general symmetric costs 
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satisfying the triangle inequality. This is in contrast to 
the Euclidean TSP, where cities correspond to points in 
the plane and the travel cost between two cities is the 
Euclidean distance between the points. For the Euclid- 
ean TSP, it is known that for any a greater than 1 
there exists a polynomial-time a-approximation algo- 
rithm. Such a result is not possible in the general 
triangle-inequality case unless NP = P. 

The | result of Christofides applies only to symmet- 
ric instances of the TSP. For asymmetric instances satis- 
fying the triangle inequality, there exists a randomized 
polynomial- time algorithm that with high probability 
produces a tour of cost no more than a factor propor- 
tional to log n/ log log n times the cost of an optimal 
tour. 

3 Exact Computations 

In her 1949 paper Robinson formulates the TSP as find- 
ing “the shortest route for a salesman starting from 
Washington, visiting all the state capitals and then 
returning to Washington.” The fact that the TSP is in 
the NP-hard class does not imply that exactly solving a 
particular instance of the problem, such as Robinson’s 
example, is an insurmountable task. Indeed, five years 
after Robinson’s paper, Dantzig, Fulkerson, and John- 
son initiated a computational approach to the problem 
by building a tour through 49 cities in the United States, 
together with a proof that their solution was the short- 
est possible. This line of work has continued over the 
years, including a 120-city solution though Germany in 
1977, a 532-city U.S. tour in 1987, and a 24 978-city 
tour through Sweden in 1998. Each of these studies 
works with symmetric integer-valued travel costs that 
approximate either the road distances or the Euclidean 
distances between city locations. 

The dominant approach to the exact solution of 
large-scale symmetric instances of the TSP is the 
linear-programming [IV. 1 1 §3.1] (LP) technique de- 
veloped by Dantzig et al. in their original 1954 paper. 
For a TSP specified by a complete graph G = (V,E), 
an LP relaxation of the TSP instance can be formulated 
using variables (x e : e e E). A tour corresponds to the 
LP solution obtained by setting x e = 1 if the edge e is 
included in the tour, and setting x e = 0 otherwise. The 
Dantzig et al. subtour relaxation of the symmetric TSP 
is the LP model 

minimize c e x e ■ e 6 E), (3) 

subject to x e : v is an end of e) = 2 Vv e V, (4) 


^ ( x e : e has exactly one end in S) ^ 2 V 0 =#= S CV , 

(5) 

0^x { <1 Vce£. (6) 

The equations (4) are called the degree equations and 
the inequalities (5) are called the subtour-elimination 
constraints. For any tour, the corresponding 0/ 1-vector 
x satisfies all the degree equations and subtour- 
elimination constraints. Thus, the optimal LP value is 
a lower bound on the cost of an optimal solution to the 
TSP. If the travel costs satisfy the triangle inequality, 
then the optimal TSP value is at most | times the opti- 
mal subtour value. The | conjecture states that this 
| factor can be improved to its study is a focal 
point of efforts to improve Christofides’s approxima- 
tion algorithm. 

In computational studies, the subtour relaxation is 
solved by an iterative process called the cutting-plane 
method. To begin, an optimal LP solution vector x* is 
computed for the much smaller model having only the 
degree equations as constraints. The method iteratively 
improves the relaxation by adding to the model individ- 
ual subtour-elimination constraints that are violated by 
x* and then computing the optimal LP solution again. 
When x* satisfies all subtour-elimination constraints, 
then it is an optimal solution to the subtour relaxation. 
The success of the method relies in practice on the fact 
that typically only a very small number of the exponen- 
tially many subtour-elimination constraints need to be 
represented explicitly in the LP relaxation. 

To improve the lower bound provided by the subtour 
relaxation, the exact-solution process continues by con- 
sidering further classes of linear inequalities that are 
satisfied by all tours, again adding individual inequali- 
ties to the LP model in an iterative fashion. If the lower 
bound provided by the LP model is sufficiently strong 
for a test instance, then the TSP can be solved via a 
branch-and-bound search using the LP model as the 
bounding mechanism. The overall process is called the 
branch-and-cut method ; it was developed for the TSP, 
and it has been successfully applied to many other 
problem classes in discrete optimization. 

4 Heuristic Solution Methods 

In many applied settings, a near-optimal solution to 
the TSP is satisfactory, that is, a tour that has cost 
close to that of an optimal tour but is not necessar- 
ily the cheapest possible route. The search for heuris- 
tic methods that perform well for such applications is 
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the most actively studied topic in TSP research, and 
this activity has helped to create and improve many of 
the best-known schemes for general heuristic search, 
such as simulated annealing, genetic algorithms, and 
local-improvement methods. This line of research dif- 
fers from that of approximation algorithms in that 
the methods do not come with any strong worst-case 
guarantee on the cost of the discovered tour, but the 
performance of typical examples can be much better 
than results produced by the known a-approximation 
techniques. 

Heuristic methods range from extremely fast algo- 
rithms based on space-filling curves through sophisti- 
cated combinations of local-improvement and genetic- 
algorithm techniques. The choice of method in a practi- 
cal computation involves a trade-off between hoped-for 
tour quality and anticipated running time. The fastest 
techniques construct a tour step by step, typically in a 
greedy fashion, such as the nearest-neighbor method, 
where at each step the next city to be added to the 
route is the one that can be reached by the least travel 
cost from the current city. The higher-quality heuristic 
algorithms create a sequence of tours, seeking modifi- 
cations that can lower an existing tour’s total cost. The 
simplest of these tour-improvement methods, for sym- 
metric costs, is the 2 -opt algorithm, which repeatedly 


searches for a pair of links in the tour that can be 
replaced by a cheaper pair reconnecting the resulting 
paths. The 2-opt algorithm serves as a building block 
for many of the best-performing heuristic methods, 
such as the Lin-Kernighan k-opt algorithm. 

In a genetic algorithm for the TSP, an initial collec- 
tion of tours is generated, say, by repeatedly applying 
the nearest-neighbor algorithm with random starting 
cities. In each iteration, some pairs of members of the 
collection are chosen to mate and produce child tours. 
A new collection of tours is then selected from the old 
population and the children. The process is repeated a 
large number of times and the best tour in the collec- 
tion is chosen as the winner. In the mating step, sub- 
paths from the parent tours are combined in some fash- 
ion to produce the child tours. In the best-performing 
variants, tour-improvement methods are used to lower 
the costs of the child tours that are produced by the 
mating process. 

Further Reading 

Applegate, D., R. Bixby, V. Chvatal, and W. Cook. 2006. 
The Traveling Salesman Problem: A Computational Study. 
Princeton, NJ: Princeton University Press. 

Cook, W. 2012. In Pursuit of the Traveling Salesman: Mathe- 
matics at the Limits of Computation. Princeton, NJ: Prince- 
ton University Press. 





1 Introduction 

Aircraft noise is a subject of intense public interest, and 
it is rarely out of the news for long. It became especially 
important early in the 1950s with the growth in the use 
of jet engines and again in the 1960s with the develop- 
ment of the Anglo-French supersonic aircraft Concorde 
(now retired). Aircraft noise is currently a major part of 
the public debate about where to build new airports and 
whether they should be built at all. 

The general public might be surprised to know that 
our understanding of the principles of aircraft noise, 
and of the main methods that are available to control 
it, have come largely from mathematicians. In particu- 
lar, Sir James Lighthill, one of the great mathematical 
scientists of the twentieth century, created a new scien- 
tific discipline, called aerodynamic sound generation, 
or aeroacoustics, when he published the first account of 
how a jet generates sound in 1952. Prior to this theory, 
which he created by mathematical analysis of the equa- 
tions of fluid motion, no means were available to esti- 
mate the acoustic power from a jet even to within a 
factor of a million. 

Moreover, Lighthill’s theory had an immediate prac- 
tical consequence. The resulting eighth-power scaling 
law for sound generation as a function of Mach num- 
ber (flow speed divided by sound speed) meant that 
the amount of thrust taken from the jet would have to 
be strictly controlled; this was achieved by the devel- 
opment of high bypass-ratio turbofan aeroengines, in 
which a high proportion of the thrust is generated by 
the large fan at the front of the engine. Such a fan 
produces a relatively small increase in the velocity of 
a large volume of air, i.e., in the bypass air flow sur- 
rounding the jet, in contrast to the jet itself, which pro- 
duces a large increase in the velocity of a small volume 
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of air. The engineering consequence of a mathematical 
theory is therefore plain to see whenever a passenger 
ascends the steps to board a large aircraft and admires 
the elegant multibladed fan so clearly on display. 


2 Lighthill’s Acoustic Analogy 


The idea behind LighthilTs theory is that the equations 
of fluid dynamics may be written exactly like the wave 
equation of acoustics, with certain terms (invariably 
written on the right-hand side) regarded as acoustic 
sources. Solution of this equation gives the sound field 
generated by the fluid flow. A key feature of the method 
is that the source terms may be estimated using the 
simpler aerodynamic theory of flow in which acoustic 
waves have been filtered out. The simpler theory, which 
includes the theory of incompressible flow as a limit- 
ing case, had become very highly developed during the 
course of the two world wars and in the period there- 
after and was therefore available for immediate use in 
Lighthill’ s theory of sound generation. 

This innocuous-sounding description of the method 
hides many subtleties, which are in no way alleviated 
by the fact that LighthilTs equation is exact. But let us 
first give the equation in its standard form and define 
its terms. The equation is 



3 2 Tjj _ 
dxidxj " 


Here, t is time and V 2 = d 2 /dxl+d 2 /dx^+d 2 /dx^ is the 
Laplacian operator, where (xi,X 2 ,X 3 ) is the position 
vector in Cartesian coordinates. Away from the source 
region, the fluid (taken to be air) is assumed to have 
uniform properties, notably sound speed Co and den- 
sity po- At arbitrary position and time, the fluid density 
is p, and the deviation from the uniform surrounding 
value is p' = p - po, the density perturbation. The left- 
hand side of LighthilTs equation is therefore simply the 
wave operator, for a uniform sound speed Co, acting on 
the density perturbation produced throughout the fluid 
by a localized jet. 
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The right-hand side of the equation employs the 
summation convention for indices (here over i,j = 
1, 2, 3) and so is the sum of nine terms in the second 
derivatives of Tn, T\2, . . . , T33. Collectively, these nine 
quantities Til form the Lighthill stress tensor, defined 
by 

Tij = pU { 11 j + (p' - CQp'iSij - T ij. 

Here, (u 1,112, M3) is the velocity vector, p' = p - po is 
the pressure perturbation, defined analogously to p' , 
and Tij is the viscous stress tensor, representing the 
forces due to fluid friction. Because of the symmetry 
of Tij with respect to i and j, there are only six inde- 
pendent components, of which the three with i = j are 
longitudinal and the three with i 4 j are lateral. 

The solution of Lighthill' s equation, together with the 
acoustic relation p' = c\p' that applies in the radiated 
sound field (though not in the source region), gives the 
outgoing sound field p' (x, t ) in the form 

, 1 S 2 f Tjj(y , t - \x - y\lcp) 

^ 4tt dxidxj Jv \x - y\ 

Here, x is the observer position, y = (y1.y2.y3) is an 
arbitrary source position, and \x - y\ is the distance 
between them. The integration is over the source region 
V, and the volume element is dV = dyi &y2 dj^. The 
retarded time t - \x - y\ /Co allows for the time it takes 
a signal traveling at speed c 0 to traverse the distance 
from y to x. 

Lighthill’s equation is referred to as an acoustic anal- 
ogy because the uniform “acoustic fluid,” in which 
acoustic perturbations are assumed to be propagating, 
is entirely fictional in the source region! For example, 
the speed of sound there is not uniform but varies 
strongly with position in the steep temperature gra- 
dients of an aircraft jet; and the fluid velocity itself, 
which contributes to the propagation velocity of any 
physical quantity such as a density perturbation, is 
conspicuously absent from a wave operator that con- 
tains only co- Nevertheless, Lighthill's power as a math- 
ematical modeler enabled him to see that the acoustic 
analogy, with Tij estimated from nonacoustic aerody- 
namic theory, would provide accurate predictions of jet 
noise in many important operating conditions, e.g., in 
subsonic jets at Mach numbers that are not too high. 

The increasing difficulty in applying Lighthill's theory 
at higher Mach numbers derives from the fact that an 
aircraft jet is turbulent [V.21] and contains swirling 
eddies of all sizes that move irregularly and unpre- 
dictably throughout the flow. The higher the Mach 
number, the more that must be known about the 


individual eddies or their statistical properties in order 
to take account of the delicate cancelation of the 
sound produced by neighboring eddies. As the Mach 
number increases, small differences in retarded times 
become increasingly important in determining the pre- 
cise amount of this cancelation. Lighthill's equation 
therefore provides an impetus to develop ever more 
sophisticated theories of fluid turbulence for modeling 
the stress tensor Tij to greater accuracy. 

2.1 Quadrupoles 

A noteworthy feature of Lighthill’s equation is that 
the source terms occur as second spatial derivatives 
of Tij. This implies that the sound field produced by 
an individual source has a four-lobed cloverleaf pat- 
tern, referred to as a quadrupole directivity. A crucial 
aspect of Lighthill's theory of aerodynamic sound gen- 
eration is, therefore, that the turbulent fluid motion 
creating the sound field is regarded as a continuous 
superposition of quadrupole sources of sound. Phys- 
ically, this is the correct point of view because away 
from boundaries the fluid has “nothing to push against 
except itself”; that is, since the pressure inside a fluid 
produces equal and opposite forces on neighboring ele- 
ments of fluid, the total dipole strength, arising from 
the sum of the acoustic effects of these internal forces, 
must be identically zero. 

Mathematically, the four-lobed directivity pattern 
arises from the expansion of p' in powers of l/jx| for 
large \x\. The dominant term, proportional to l/|x|, 
gives the radiated sound field; its coefficient contains 
angular factors XiXj / |x| 2 corresponding to the deriva- 
tives d 2 /dxidxj acting on the integral containing Tij- 
For example, in spherical polar coordinates ( r,Q,<p ) 
with r = | x | and 

(X1.X2.X3) = (r sinf? cos <fi, r sind sin <p, r cos 6), 

the lateral (xi,X2) quadrupole has the directivity pat- 
tern 

X1X2 1 . 2 Q . „ , 

3 — = 7. sin 0 sm 2®. 

|x| 2 

This pattern has lobes centered on the meridional half- 
planes <p = tt / 4 , 3rr/4, Stt /4, and 7 tt/ 4, separated by 
the half-planes <p = 0, tt/2, tt, and 3tt/2, in which there 
is no sound radiation. 

2.2 The Eighth-Power Law 

Lighthill’s theory predicts that the power generated 
by a subsonic jet is proportional to the eighth power 
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of the Mach number. This follows from a beautifully 
simple scaling argument based on only the equations 
given above, and we will now present this argument in 
full, at the same time giving the scaling laws for the 
important physical quantities in the subsonic jet noise 
problem. 

Assume that a jet emerges with speed U from a noz- 
zle of diameter a into a fluid in which the ambient den- 
sity is p o and the speed of sound is c o. The Mach num- 
ber of the jet is therefore M = U/Co, and the timescale 
of turbulent fluctuations in the jet is a/U. The fre- 
quency therefore scales with U /a, and the wavelength 
A of the radiated sound scales with Coa/U, i.e., a/M. We 
are considering the subsonic regime M < 1, so a < A, 
and the source region is compact. 

The Lighthill stress tensor Tij is dominated by the 
terms puiUj, which scale with pU 2 , and the integra- 
tion over the source region gives a factor a 3 . In the 
far field, the term \x — y\ appearing in the denomi- 
nator of the integrand may be approximated by \x\, 
and the spatial derivatives applied to the integral intro- 
duce a factor 1/A 2 . The far-held acoustic quantities 
are the perturbation pressure, density, and velocity, 
i.e., p' , p', and u' , which satisfy the acoustic relations 
p' = Cqp' = poCou'. Putting all this together gives the 
basic far-held scaling law 

_p_ = P^_ = U/ ~ M 4_^ 

PoCo PO Co \x \ ' 

In a sound wave, the rate of energy how, measured 
per unit area per unit time, is proportional to p'u'. 
Since a sphere of radius \x\ has area proportional to 
|x| 2 , it follows that the acoustic power W' of the jet, 
i.e., the total acoustic energy that it radiates per unit 
time in all directions, obeys the scaling law 


3“A ~ ™ ■ 

poqja 2 

This is the famous eighth-power law for subsonic jet 
noise. It would be hard to hnd a more striking exam- 
ple of the elegance and usefulness of the best applied 
mathematics. 

The signihcance of the high exponent in the scaling 
law is that at very low Mach numbers a jet is ineffi- 
cient at producing sound; but this inefficiency does not 
last when the Mach number is increased. This is why 
jet noise became such a problem in the early 1950s, 
as aircraft engines for passenger flight became more 
powerful. Unfortunately, there is no way to evade a fun- 
damental scaling law imposed by the laws of mechan- 
ics, and it is impossible to make a very-high-speed 


jet quiet. This was the original impetus to develop 
high-bypass-ratio turbofan aeroengines to replace jet 
engines. 

3 Further Acoustic Analogies 

The acoustic analogy has been extended in many ways. 
For example, the Ffowcs Williams-Hawkings equation 
accounts for boundary effects by means of a vector Ji 
and a tensor Ly to represent mass and momentum flux 
through a surface, which may be moving. These terms 
are given by 

Ji = P(Ui - Vi) + po Vi 

and 

Uj = puduj - Vj ) + (p - po) 8 ij - Tij , 

where (iq, V2, V3) is the velocity of the surface and 
Tij is the viscous stress tensor. Derivatives of Ji and 
Lij, with suitable delta functions included to localize 
them on the surface, are then used as source terms 
for the acoustic wave equation. The Ffowcs Williams- 
Hawkings equation has been of enduring importance 
in the study of aircraft noise and is widely used in 
computational aeroacoustics. 

Variations of Lighthill’s equation can be obtained 
by redistributing terms between the right-hand side, 
where they are regarded as sources, and the left-hand 
side, where they are regarded as contributing to the 
wave propagation operator; different choices of the 
variable (or the combination of variables) on which 
the wave operator acts are also possible. Although 
the equations obtained in this way are always exact, 
the question of how useful they are, and whether the 
physical interpretations they embody are “real,” has 
at times been contentious and led to interminable 
discussion! Among the widely accepted equations are 
those of Lilley and Goldstein, which place the convec- 
tive effect of the jet mean flow in the wave operator 
and modify the source terms on the right-hand side 
accordingly. Morfey’s acoustic analogy is also useful; 
this provides source terms for sound generation by 
unsteady dissipation and nonuniformities in density, 
such as occur in the turbulent mixing of hot and cold 
fluids. 

4 Howe’s Vortex Sound Equation 

Since the vorticity of a flow field is so important in 
generating sound, both in a turbulent jet and in other 
flows, it is desirable to have available an equation that 
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contains it explicitly as a source, rather than implicitly, 
as in the Lighthill stress tensor. The vorticity to, defined 
as the curl of the velocity field, is a measure of the local 
rotation rate of a flow. In vector notation, to = V a m, 
where a indicates the cross product; in components, 


1 3 u 3 

du2 3ui 

3«3 3u2 

du i \ 

V 3x2 

3x3 ’ 9x3 

3xi ’ 3xi 

3X2 / 


An extremely useful equation with vorticity as a 
source is Howe’s vortex sound equation, in which the 
independent acoustic variable is the total enthalpy 
defined by 

j-J^+1 i«i ! . 

The symbol B is used to stand for Bernoulli because 
this variable appears in Bernoulli’s equation. Howe’s 
vortex sound equation is a far-reaching generalization 
of Bernoulli’s equation, and is 

where c is the speed of sound and D/Dt = 3/3 1 + u ■ V 
is the convective derivative. In the limit of low Mach 
number, this simplifies to 

and the pressure in the radiated sound field is p = poB. 

Howe’s equation is used to calculate the sound pro- 
duced by the interaction of vorticity with any type of 
surface. Examples include aircraft fan blades interact- 
ing with turbulent inflows or with the vortices shed 
by any structure upstream of the blades. An advan- 
tage of Howe’s equation is that it provides analytical 
solutions to a number of realistic problems, and hence 
gives the dependence of the sound field on the design 
parameters of an aircraft. 

5 Current Research in Aircraft Noise 

Aircraft noise is such an important factor in limiting 
the future growth of aviation that research is carried 
out worldwide to limit its generation. Jet noise is dom- 
inant at takeoff, fan and engine noise are dominant in 
flight, and airframe noise is dominant at landing, so 
there is plenty for researchers to do. Sources of engine 
noise are combustion and turbomachinery, and sources 
of airframe noise are flaps, slats, wing tips, nose gear, 
and main landing gear. The propagation of sound once 
it has been generated is also an important area of air- 
craft noise research, especially the propagation of sonic 
boom. 


A new subject, computational aeroacoustics, was cre- 
ated thirty years ago, and this is now the dominant 
tool in aircraft noise research. The reader might won- 
der why its elder sibling, computational fluid dynamics, 
is not adequate for the task of predicting aircraft noise 
simply by “adding a little bit of compressibility” to an 
established code. The answer lies in one of the most 
pervasive ideas in mathematics, that of an invariant— 
here, the constant in Bernoulli’s theorem. In an incom- 
pressible flow, a consequence of Bernoulli’s theorem 
is that pressure fluctuations are entirely balanced by 
corresponding changes in velocity. Specialized com- 
putational techniques must therefore be developed to 
account for the minute changes in the Bernoulli “con- 
stant” produced by fluid compressibility and satisfying 
the wave equation. Computational aeroacoustics takes 
full account of such matters, which are responsible for 
the fact that only a minute proportion of near-field 
energy (the province of computational fluid dynamics) 
propagates as sound. 

Research on aircraft noise takes place in the world’s 
universities, companies, and government research es- 
tablishments, and all of it relies heavily on mathemat- 
ics. It is remarkable how many of the most practi- 
cal contributions to the subject are made by individ- 
uals who are mathematicians; no fewer than six of the 
authors in the further reading section below took their 
first degree in mathematics (many with consummate 
distinction) and have used mathematics throughout 
their careers. 
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VII.2 A Hybrid Symbolic-Numeric 
Approach to Geometry 
Processing and Modeling 

Thomas A. Grandine 


1 Background 

Numerical computations for scientific and engineering 
purposes were the very first applications of the elec- 
tronic computer. Originally built to calculate artillery 
firing tables, the first use of ENIAC (Electronic Numer- 
ical Integrator and Computer) was performing calcu- 
lations for the hydrogen bomb being developed at 
Los Alamos. Indeed, all of the applications of early 
computers were numerical applications. After the art 
of computer programming had matured considerably, 
Macsyma at MIT introduced the notion of symbolic 
computing to the world. Unlike numerical comput- 
ing, symbolic computing solves problems by producing 
analytic expressions in closed form. 

More recently, tools have been developed to combine 
these two approaches in powerful and intriguing ways, 
Chebfun and Sage being prime examples. At Boeing, my 
colleagues and I have also been pursuing this idea for 
some time, and we continue to discover new ways of 
leveraging this technology to solve problems. 

Geometric models are required throughout Boeing 
for many reasons: digital preassembly, engineering 
analysis, direct manufacturing, supply chain manage- 
ment, cost estimation, and myriad other applications. 
Depending on the application, models maybe very sim- 
ple (see figure 1) or very complex (see figure 2). Regard- 
less of complexity, many computations need to be 
performed with these models. Properly conveying the 
algebraic formulation of these geometry problems so 
that the computations can be performed is not always 
straightforward. 

Boeing has software that takes advantage of the com- 
bined strength of numeric and symbolic methods. It 
is named Geoduck (pronounced GOO-ee-duck) after a 
large clam indigenous to the Pacific Northwest in the 
United States. Like Sage, it is based on python [VII. 11] 
programming. The tool enables construction of geom- 
etry models at arbitrary levels of detail and rapid query 
and processing of those models by most downstream 
applications. 

like most commercially available geometric model- 
ing tools, Geoduck represents geometric entities as 


■X 3 b/2 = wing semispan 

Jh < 



Figure 1 A simple two-dimensional 
airplane wing planform model. 



Figure 2 A detailed model of the Boeing 777. 


mathematical maps from a rectangular domain (param- 
eter space) to a range (model space). For example, the 
cylinder shown in figure 3 depicts the range of the map 


S(u, v ) 


^cos 2ttu\ 
sin2rru 

1 4v 


Geoduck uses tensor product splines [IV.9 §2.3] as the 
primary functional form. The most important reasons 
for this are that they have a convenient basis for com- 
putation (B-splines), they live in nestable linear function 
spaces, and they are made up of polynomial pieces that 
can be integrated and differentiated symbolically. 


2 Some Simple Examples 

Given a planar curve represented in Geoduck as a para- 
metric spline map c from [0,1] to two-dimensional 
model space, one frequent problem is finding the clos- 
est and farthest points from a given point p, specifically 
to determine the extreme points of 

j(c(u) - p) ■ ( c(u ) - p). 

Like Chebfun and Sage, Geoduck permits functions to 
be assigned to variables, and this makes calculations 
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Figure 3 A simple parametric surface map. 


with spline functions easy to code and understand 
in Geoduck. After accounting for the endpoints, the 
extreme points are found by using the following Geo- 
duck code to differentiate the expression and set the 
resulting expression to zero: 

cder = c . Di fferenti ate () 
der = Dot (cder, c - p) 
expts = der. Zeros () 

Arithmetic operators have been overloaded in Geoduck 
to work on splines, and this is done in the assignment 
to der. Since splines are piecewise polynomial func- 
tions, any linear combination of them is also a spline 
and can be represented as a B-spline series, and that is 
how Geoduck implements overloaded operators. 

Since c is a spline function, so is its derivative, and 
the Differentiate method performs the expected 
function. Similarly, the Dot function forms the spline 
function that is the dot product of cder and c - p. 
Note that the product of two splines is also a spline. 

All of the operations up to this point have been car- 
ried out analytically. The Zeros method, however, is a 
numerical method because it is designed to work on 
piecewise polynomial functions of arbitrary degree. In 
Geoduck, the method is a straightforward implemen- 
tation of a 1989 algorithm of mine, and it finds all of 
the real zeros of the spline function. The results of this 
code are depicted in figure 4. Here, the curve c is shown 
along with line segments drawn from the point p to the 
calculated points on the curve c. The segments corre- 
sponding to local maxima are shown as dashed lines, 
while the local minima are shown as dotted lines. 

Of course, the basic recipe of performing a deriva- 
tion for a geometric calculation of interest then encod- 
ing that derivation into a Geoduck script that combines 



Figure 4 The local extrema of the 
distance from a point to a curve. 

symbolic and numeric calculations can be followed 
over and over again to solve a very rich collection of 
interesting problems. 

As a second example, consider the problem of finding 
common tangents to a pair of curves c and d. As before, 
both curves will be represented as vector-valued spline 
functions in two dimensions. If points on c are given 
by c(u) and points on d are given by d(v), then a com- 
mon tangent will be the line segment between points 
c(u ) and d(v) that has the property that the vector 
c(u ) - d(v) is parallel to the tangents of both curves, 
given by c'(u) and d' (v). The (u,v) pairs that satisfy 
the following system of equations correspond to the 
endpoints of the desired tangent lines: 

c' (u) x (c(u) - d(v)) = 0, 
d'iv) x (c(u) - d(v)) = 0, 

where x is the scalar-valued cross product in R 2 , i.e., 
the determinant of the 2x2 matrix whose columns are 
the operands. 

Consider the following Geoduck code: 

cminusd = TensorProduct (c, d) 

cder = cmi nusd . Di fferenti ate (0) 
dder = cmi nusd . Di fferenti ate (1) 
fl = Cross (cder, cminusd) 
f2 = Cross (dder, cminusd) 
tanpts = Zeros ([fl, f2]) 

This example introduces the TensorProduct function, 
which takes a pair of spline functions as inputs and 
returns a new spline function defined over the tensor 
product of the domains of the original functions. In this 
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Figure 5 Calculated common tangents to a pair of curves. 

case, both c and d are defined over the interval [0,1], so 
the new spline function cmi nusd is defined over [0, l] 2 . 

The Differentiate method is similar to before, 
except now it is applied to functions of more than one 
variable, so it will return the partial derivative with 
respect to the variable indicated by the index passed 
in. The Cross function here is similar to Dot in the first 
example, and it also requires the two input splines to 
be defined over the same domain, which in this case 
is [0, l] 2 . 

Finally, the Zeros function solves the given spline 
system of equations. The curves c and d are shown in 
figure 5 along with the four computed tangent lines. 

3 Industrial Examples 

Although Geoduck excels at solving simple geometric- 
analysis problems as just illustrated, it also solves 
real-life geometric-modeling and geometry-processing 
problems faced by Boeing. Consider the wing depicted 
in figure 6. Following standard practice, a surface iv 
is constructed that models the wing as a tensor prod- 
uct spline. Usually, the points w(u,v) on the wing are 
arranged so that the variable v increases with wingspan 
and the functions w(-,v) for each fixed value of v 
describe individual airfoils. Each such airfoil has the 
property that iv(0,v) = iv(l,v) is the point on the 
trailing edge, and as u increases, the point on the air- 
foil traverses first the lower part of the airfoil and then 
the upper part. 

One practical problem that needs to be solved as 
a preprocessing step for many aerodynamic analysis 
simulations is to determine the leading edge of the 
wing. For each airfoil section, the point on the leading 
edge is characterized by being the point that is farthest 
from the trailing edge. Thus, for each v it can be deter- 
mined by minimizing (i u(u,v) - ic(0, v)) ■ ( w(u,v ) - 



Figure 6 A wing model with leading edge curve shown. 

w(0,v)). Since v is (temporarily) fixed, this problem 
can be solved by differentiating with respect to u and 
setting the resulting expression equal to 0: 

w u (u,v) ■ ( w(u,v ) - w(0,v)) = 0. 

Since this equation can be solved for u for every value 
of v, that leads to an entire one-parameter family of 
solutions of the form iv(u(v),v) for the implicit func- 
tion u given by the equation. The resulting leading edge 
curve is shown in figure 6, and it is generated with the 
following Geoduck code: 

wO = w.Trim ([[0.0, 0.0], [0.0, 1.0]]) 

zero = Line ([0.0], [0.0]) 

wOuv = TensorProduct (zero, ’+’, wO) 

wu = w. Differentiate (0) 

f = Dot (wu , w - wOuv) 

le = f. Intersect ([1.0, 0.0]) 

The spline function zero is just the function defined 
over [0, 1] whose value is zero everywhere. It is needed 
here so that wOuv is a function of two variables. 

This is very similar to the previous examples, with 
only two new wrinkles. The first is the use of the T ri m 
method to construct a new spline identical to the origi- 
nal except that its domain of definition is restricted to a 
subset of the original. Since the value of u is restricted 
in this case to be 0, the function wO is a function of 
only one variable. Making wO a function of two variables 
necessitates the use of TensorProduct so that Dot is 
once again provided with input functions with identical 
domains. The last wrinkle is the use of the Intersect 
method. In this case, it finds the zero set of an under- 
determined system in the domain of /. The resulting 
locus of points in that domain is the preimage of the 
leading edge curve we are seeking. 

As a final example, consider the problem of locating 
a spar of maximum depth along the span of a wing, 
something that is very desirable for structural reasons. 
This time, imagine the wing split into two pieces ivi 
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Figure 7 The maximum-depth spar for the given wing. Only 
the lower part of the wing is shown here so that the spar is 
visible. 


and W 2 - Suppose that w i is the upper part of the wing 
while u >2 is the lower part. As before, the second param- 
eter increases with the span of the wing. Imagine that 
both surfaces have the same parametrization, so that 
the y component of the two pieces satisfies y \ ( ■ , Vi ) = 
y 2 ( ■ , U 2 ) whenever V\ = V 2 - Thus, for each value of Vi 
the vertical distance between the upper and lower sur- 
faces will be maximized whenever the corresponding 
tangent vectors are parallel, i.e., when 


3 fxi(ui,vi) 
dui Vzi(wi,vi) 


3 / X2 ( U-2 , V2 ) 

du 2 \Z2(U2, V2 ) 


With this in mind, new surfaces 5i and 52 can be defined 
by 

5l(Wl, Vl) = 


S2(U2,V 2 ) = 


Note that the third component of each of these new 
surfaces is the slope of the curves that are the sections 
of the surfaces. The tangent vectors are parallel when 
these slopes are the same. Thus, curves along each of 
the two original surfaces 5i and 52 can be calculated 
as the intersection of the new surfaces 5i and 52 . The 
resulting spar is the ruled surface between these two 
curves, which is shown in figure 7. 


/ Xl(Ml,Vl) \ 

yi(ui, Vi) 

(d/dui)zi{ui,vi) 

V (3/3ui)xi (mi, v\) ) 
/ X 2 ( U 2 , V 2 ) \ 

3 / 2(u 2 , v 2 ) 

(3/3m 2 )Z2(U2,U 2 ) 

V (d/dU2)X2(U2, V2) / 
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VII. 3 Computer-Aided Proofs via 
Interval Analysis 

Warwick Tucker 

1 Introduction 

The aim of this article is to give a brief introduction 
to the field of computer-aided proofs. We will do so 
by focusing on the problem of solving nonlinear equa- 
tions and inclusions using techniques from interval 
analysis. Interval analysis is a framework designed to 
make numerical computations mathematically rigor- 
ous. Instead of computing approximations to sought 
quantities, the aim is to compute enclosures of the 
same. This requires taking both rounding and dis- 
cretization errors into account. 

1.1 What Is a Computer-Aided Proof? 

Computer-aided proofs come in several flavors. The 
types of proofs that we will address here are those 
that involve the continuum of the real line. Thus, the 
problems we are trying to solve are usually taken from 
analysis, rather than from, say, combinatorics. 

Rounding errors can be handled by performing all 
calculations with a set-valued arithmetic, such as inter- 
val arithmetic with outward rounding. The discretiza- 
tion errors, however, require some very careful analy- 
sis, and this usually constitutes the genuinely hard, 
analytical, part of the preparation of the proof. Quite 
often, the problem must be reformulated as a fixed- 
point equation in some suitable function space, and 
this can require some delicate arguments from func- 
tional analysis. Moreover, the discretization bounds 
must be made completely rigorous and explicit; in fact, 
they must be computable in finite time. 

1.2 Examples from Mathematics 

The field of interval analysis [11.20] has reached a 
high level of maturity, and its techniques have been 
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used in solving many notoriously hard problems in 
mathematics. 

As a testament to the advances made in recent years, 
we mention a few results that have been obtained using 
computer-aided proofs: 

• Hass and Schlafly (2000) solved the double bubble 
conjecture (a problem stemming from the theory of 
minimal surfaces); 

• Gabai et al. (2003) showed that homotopy hyper- 
bolic 3-manifolds have noncoalescable insulator 
families; 

• the long-standing Kepler conjecture (concerning 
how to densely pack spheres) was recently settled 
by Hales (2005); and 

• the present author solved Smale’s fourteenth prob- 
lem about the existence of the Lorenz attractor in 
1999. 

2 Solving Nonlinear Equations 

In this section we describe some computer-aided tech- 
niques for solving nonlinear equations. As we shall see 
later, this is a basic (hard) ingredient in many problems 
from analysis. Some of the presented techniques can 
be extended to the infinite-dimensional setting (e.g., 
fixed-point equations in function spaces), but in order 
to keep the exposition simple and short we will cover 
only the finite-dimensional case here. 

To be precise, given a function /, together with a 
search domain X, our task is to establish the existence 
(and uniqueness) of all zeros of / residing inside X. 
Due to the nonlinearity of /, and the global nature of 
X, this is indeed not a simple task. 

A basic principle of interval analysis is to extend our 
original problem to a set-valued version that satisfies 
the inclusion principle. In the setting we are interested 
in (solving fix ) = 0), this means that we must find 
an interval extension of /: that is, an interval-valued 
function F satisfying 

range(/; x) = {fix): x e x} £ Fix). (1) 

Here, x denotes an interval and x denotes a real num- 
ber. When / is an elementary function, F is obtained 
by substituting all appearing arithmetic operators and 
standard functions with their interval-valued counter- 
parts. In higher dimensions, we consider the compo- 
nents of / separately. 

We will now briefly describe two techniques that we 
use in our search for the zeros of /. 


2.1 Interval Bisection 

Interval analysis provides simple criteria for exclud- 
ing regions in the search space where no solutions to 
fix) = 0 can reside. The entire search region X is 
adaptively subdivided by set-valued bisection into sub- 
rectangles xi , each of which must withstand the test 
0 e F(x i). Failing to do so results in the subrectangle 
being discarded from further search; by (1), the set x* 
cannot contain any solutions. The bisection phase ends 
when all remaining subrectangles have reached a suffi- 
ciently small size. What we are left with is a collection 
of rectangles whose union is guaranteed to contain all 
zeros of f (if there are any) within the domain X. 

As a simple example, consider the function fix) = 
sinx(x - cosx) on the domain X = [-10, 10] (see fig- 
ure 1(a)). This function clearly has eight zeros in X: 
{±3rr, ±2rr, ±tt, 0,x*}, where x* is the unique (posi- 
tive) zero of x - cos x = 0. Applying the interval bisec- 
tion method with the stopping tolerance 0.001 pro- 
duces the nine intervals listed below. Note, however, 
that intervals 4 and 5 are adjacent. This always hap- 
pens when a zero is located exactly at a bisection point 
of the domain. 

Domain : [-10,10] 

Tolerance : 0.001 
Function calls: 227 
Solution list : 

1: [-9.42505,-9.42444] 6: [+0.73853,-1-0.73914] 

2: [-6.28357,-6.28296] 7: [+3 . 14148 , +3 . 14209] 

3: [-3.14209,-3.14148] 8: [+6 . 28296 , +6 . 28357] 

4: [-0.00061, +0.00000] 9: [+9. 42444, +9. 42505] 

5: [+0.00000, +0.00061] 

The mathematical content of this computation is that 
all zeros of /, restricted to [-10, 10], are contained in 
the union of the nine output intervals (see figure 1(b)). 
No claims can be made about existence at this point. 

2.2 The Interval Newton Method 

Another (equally important) tool of interval analysis 
is a checkable criterion for proving the existence and 
(local) uniqueness of solutions to nonlinear equations. 
The underlying tool is based on a set-valued exten- 
sion of Newton’s method, which incorporates the Kan- 
torovich condition, ensuring that the basin of attrac- 
tion has been reached. Let x denote a subrectangle that 
survived the bisection process, and let x be a point in 
the interior of x (e.g., its midpoint). We now form the 
Newton image of x: 

Nix)=x-[DFix)r 1 fix). 



792 


VII. Application Areas 



-10 -8 -6 -4 -2 0 2 4 6 8 10 



Figure 1 (a) The function f(x) = sinx(x - cosx) on the 
domain X = [-10, 10]. (b) Increasingly tight enclosures of 
the zeros. 



Figure 2 One iteration of the interval Newton method. 


particles) with masses m* > 0, moving according to 
Newton’s laws of motion. A relative equilibrium is a 
planar solution to Newton’s equations that performs a 
rotation of uniform angular velocity about the system’s 
center of mass. Thus, in a rotating coordinate frame, 
the constellation of bodies is fixed. 

A long-standing question, raised by Aurel Wintner 
in 1941, concerns the finiteness of the number of 
(equivalence classes of) relative equilibria. In 1998, 
Fields medalist Steven Smale listed a number of chal- 
lenging problems for the twenty-first century. Problem 
number 6 reads: 

Is the number of relative equilibria finite, in the n- 
body problem of celestial mechanics, for any choice 
of positive real numbers mi,.. . ,m n as the masses? 


This is a set-valued version of the standard Newton 
method: note that the correction term involves solv- 
ing a linear system with interval entries. If we cannot 
solve this linear system, we apply some more bisection 
steps to the set x. The Newton image carries some very 
powerful information. If N(x) n x = 0, then x can- 
not contain any solution to fix) = 0 and is therefore 
discarded. On the other hand, if N(x) £ x , then x con- 
tains a unique zero of /. In the remaining case, we can 
shrink x into N(x) n x and redo the Newton test. If 
successful, this stage will give us an exact count of the 
number of zeros of f(x) within the domain X. 

In figure 2 we illustrate the geometric construction 
of the Newton image. 

3 An Application to the 
Restricted n-Body Problem 

The Newtonian n-body problem addresses the dynam- 
ics of n bodies (which can be assumed to be point 


At the time of writing, this problem remains open for 
n > 5. 

Recently, Kulevich et al. (2009) established that the 
number of equilibria in the planar circular restricted 
four-body problem (PCR4BP) is finite for any choice of 
masses. The PCR4BP under consideration consists of 
three large bodies (primaries) with arbitrary masses at 
the vertices of an equilateral triangle rotating on circu- 
lar orbits about their common center of mass. A fourth 
infinitesimal mass, subject to the gravitational attrac- 
tion of the primaries, is inserted into their plane of 
motion and is assumed to have no effect on their circu- 
lar orbits. In an appropriately selected rotating frame, 
the primaries are fixed (see figure 3). 

Since we are concerned only with equivalence classes 
of solutions, we may normalize the system's total mass 
such that Y.i wij = 1 . Without loss of generality, we may 
also assume that the equilateral triangle of primaries 
has unit sides. This makes the problem compact and 
thus amenable to a global search. 
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Figure 3 The basic configuration with Lagrange’s 
equilateral triangle for the PCR4BP. 


Let xi,X 2 ,X 3 ,c 6 R 2 denote the (fixed) positions 
of the three primary bodies and their center of mass, 
respectively. Let z = (zi,Z 2 ) be the position of the 
fourth (infinitesimal mass) body. Then its motion is 
governed by the system of differential equations 


z i 


Z 2 


dV 
dz i 


+ 2z 2 , 


dV 

dZ2 


2zr, 


where the potential V is given by 
V(z) = | ||z — clll + X 


fc=i 


\Xi - Z || 2 


( 2 ) 


Thus, the relative equilibria are given by the critical 
points of V. In essence, we have reduced the problem 
to that of solving nonlinear equations. 

Kulevich et al. showed not only that the number of 
relative equilibria in the PCR4BP is finite, but that there 
are at most 196 equilibrium points. This upper bound is 
believed to be a large overestimation. Numerical explo- 
rations by Simo (1978) indicate that there are eight, 
nine, or ten equilibria, depending on the values of the 
masses. 

The proof of Kulevich et al. is based on techniques 
from Bernstein-Khovanskii-Kushnirenko (BKK) theory. 
This provides checkable conditions determining if a 
system of polynomial equations has a finite number of 
solutions for which all variables are nonzero. The tech- 
niques utilized stem from algebraic geometry, such as 
the computation of Newton polytopes, and are more 
general than Grobner basis methods. 
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Figure 4 The PCR4BP in the equal-mass case m\ = m 2 = 
m 3 = j. The three primaries are marked by circles; the ten 
equilibria are marked by disks. The rectangular grid stems 
from the subdivision in the adaptive elimination process. 


An interesting special case is taking all masses equal: 
mi = m 2 = m 3 = 3 . The PCR4BP then has exactly ten 
solutions, as was first estabhshed by Lindow in 1922. 
We will illustrate how rigorous zero-finding techniques 
can readily deal with this scenario. 

By the normalization of the primaries and masses, we 
can restrict our search to the square X = [-|, + |] 2 .Let 
/(z) = VV(z) denote the gradient of V, and extend it 
to a set-valued function satisfying the inclusion princi- 
ple (1). The problem is now reduced to finding (or sim- 
ply counting) the zeros of f restricted to X. Using a 
combination of the set-valued bisection and the inter- 
val Newton operator described above, it is straightfor- 
ward to establish that the PCR4BP in the equal-mass 
case has exactly ten relative equilibria. Using a stop- 
ping tolerance of 1 CU 2 in the bisection stage, we arrive 
at 1546 discarded subrectangles, with 163 remaining 
for the next stage. The Newton stage discards another 
150 subrectangles and produces ten isolating neigh- 
borhoods for the relative equilibria (and three for the 
primaries) (see figure 4). 

As far as our approach is concerned, there is nothing 
special about choosing the masses to be equal in ( 2 ). 
Taking, for example, m = ( xq , 3 % , ^), we can repeat 
the computations and find that there are exactly eight 
relative equilibria (see figure 5). 
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Figure 5 The PCR4BP in the case m = ( yq , yj , ). 


The computations involved take a fraction of a sec- 
ond to perform on a standard laptop. 

4 Solving Nonlinear Inclusions 

In the set-valued framework it makes a lot of sense to 
extend the concept of solving equations of the form 
fix) = y to include solving for inclusions fix) e y, 
where y is a given set. This is a very natural problem 
formulation in the presence of uncertainties; think of 
y as representing noisy data with error bounds. 

If y is a set with nonempty interior, we should expect 
the solution set S to share this property. Therefore, we 
should be able to approximate the solution set both 
from the inside and from the outside. In other words, 
we would like to compute two sets 5 and S that satisfy 

S £ S £ S. 

S and S are called the inner and outer approximations 
of S, respectively. By measuring the size of their dif- 
ference, S\S, we can obtain reliable information about 
how close we are to the solution set S. 

Given a partition I’iX) of the domain X , the outer 
approximation S is computed precisely as in sec- 
tion 2.1: it contains all partition elements whose inter- 
val images have nonempty intersection with the range 
y. The inner approximation contains all partition ele- 
ments whose interval images are contained in the range 



Figure 6 An outer approximation S of the solution set S 
(shaded). All rectangles of width greater than 10 -2 belong 
to the inner approximation 5. 


y. In other words, we have 

S = {x eFiX): Fix) s y}, 

S= {xe TiX):Fix) ny*0}. 

As an example, consider the nonlinear function 

fix) = sinxi + sinx 2 + |(x 2 + x|) 

and suppose we want to find the set 

S = {x G [-5, +5] 2 : fix) E [-0.5, 0.5]}. 

By a simple bisection procedure, we adaptively parti- 
tion the domain X = [-5, 5] 2 into subrectangles and 
discard rectangles x such that Fix) n [-0.5, 0.5] = 0. 
The remaining rectangles are classified according to 
whether they belong to S and/or S. Note that, as soon 
as a subrectangle is determined to belong to S, it under- 
goes no further bisection. Only subrectangles whose 
interval images intersect dS are subdivided. This is 
illustrated in figure 6, in which a stopping tolerance 
of 10 -2 was used. 

The ability to solve nonlinear inclusions is of great 
importance in parameter estimation. Given a finitely 
parametrized model function fix\p) = y together 
with a set of uncertain data (xi , y\ ),..., (x n , y n ) and 
a search space T, the task is to solve for the set 

S = {p eT: fixcp) e y u i= 1 ,...,n}. 

This can be done by computing inner and outer approx- 
imations of 5, exactly as described above. Today there 
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exist very efficient techniques (e.g., constraint propaga- 
tion) for solving precisely such problems. 

5 Recent Developments 

In this article we have focused exclusively on nonlinear 
equation solving, but computer-aided proofs are also 
used in many other areas. Global optimization is a field 
that is well suited to rigorous techniques, and there 
are many software suites that rely on interval analysis 
and set-valued computations. A great deal of effort has 
been expended developing methods for rigorously solv- 
ing ordinary differential equations, and several mature 
software packages that produce validated results at a 
reasonable computational cost now exist. There have 
also been many successful endeavors in the realm of 
partial differential equations, but here each problem 
requires its own set of tools and there is no natu- 
ral one-size-fits-all approach to this vast area. Another 
area of application is parameter estimation, where set- 
valued techniques are used to model the uncertain- 
ties in the estimated states. With the recent interest in 
uncertainty quantification, rigorous computations have 
a bright future. 
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VII.4 Applications of Max-Plus Algebra 

David S. Broomhead 


1 Basics 

Max-plus algebra is a term loosely applied to algebraic 
manipulations using the operations max and +. More 
precisely, consider the set of real numbers R and aug- 
ment this with a smallest (with respect to the total 
ordering on R) element s = - co . Let us say R = R u { £ } . 
A rich arithmetic on R is then obtained by defining the 
binary operations © and ®: 

aeb=max{a,b} Va, b 6 R, 
a®b = a + b \/a,b 6 R. 

In this algebra, £ acts as the “zero” element in the sense 
that a © £ = a and a® £ = £ for any «el, while 0 acts 
as the unit element since a ® 0 = a for any «el. Both 
© and ® are commutative and associative operations, 
and ® distributes over © because the identity 

a + max{b,c} = maxfa + b,a + cj 
translates directly to the distributive law 
a® (bee) = (a® b) © (a® c). 

Max-plus powers of any a G R can be defined recur- 
sively by a® <k+1) = a® a® k , with a® 0 = 0. Note that 
a® k = k x a. This can be extended to any real-valued 
power, a , by defining = ax a. 

The arithmetic that is taught in childhood differs 
from this max-plus system in important ways. In partic- 
ular, a © a = a for all a £ R, so © is idempotent. A con- 
sequence of this is that there is no additive inverse, i.e., 
there is not a nice analogue of the negative of a num- 
ber. (The negative numbers in R actually correspond to 
the multiplicative inverse since x ® (-x) = 0). Math- 
ematically, (R, ©,®) is a semifield, albeit one that is 
commutative and idempotent. 

2 Tropical Mathematics 

Tropical mathematics is a recent, broader name for 
algebra and geometry based on the max-plus semifield 
and related semifields. In many applications, it can be 
more convenient to use the min-plus semiheld, which 
is defined as above but using the min operation rather 
than max. In this case, the binary operations are defined 
over R u { oo } . The max-plus and min-plus semifields are 
isomorphic by the map x ■— -x. A third commutative 
idempotent semifield is called max-times. It is based on 
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the nonnegative reals v\ith operations max and times 
and is isomorphic to max-plus by the map x •— logx. 

3 An Overview of Applications 

Tropical mathematics has been applied to such a wide 
range of problems that it is impossible to cover them 
all here. It has been used to analyze the timing of 
asynchronous events, such as timetabling for trans- 
port networks, the scheduling of complex tasks, the 
control of discrete-event systems, and the design of 
asynchronous circuits in microelectronics. There are 
deep connections with algebraic geometry, which have 
led to applications in the analysis of deoxyribonucleic 
acid (DNA) sequences and their relation to phylogenetic 
trees. There is also work on optimization that is con- 
nected with semiclassical limits of quantum mechanics, 
Hamilton-Jacobi theory, and the asymptotics of zero- 
temperature statistical mechanics. In all these cases 
the mathematical structures, built on the semifield 
structure described above, lead to simple, often lin- 
ear, formulations of potentially complicated nonlinear 
problems. 

Consider a simple example of scheduling for asyn- 
chronous processes. At time xi an engineer begins to 
make a component that takes d\ hours to complete. 
In the max-plus notation, the time in which the task 
is completed is a\ ® xi. A second engineer begins to 
make a different component at time X 2 that takes «2 
hours. If both components are to be combined to make 
the finished product, the product cannot be completed 
before (ai ® xi) © (a 2 ® X2), i.e., before the last com- 
ponent is finished. The two operations ® and © occur 
naturally here, and there is also a hint of linear algebra, 
since the earliest time for completion appears to be the 
scalar product, in a max-plus sense, of a vector of start- 
ing times (xi,X 2 > and the vector of durations of each 
task («i, 0,2). 

4 Timing on Networks 

The timetable for a rail network has to coordinate the 
movements of many independent trains in order to pro- 
vide a safe and predictable service. A range of issues 
have to be addressed: railway stations have limited 
numbers of platforms; parts of the network may have 
single-track lines; passengers need to make connec- 
tions; etc. Max-plus algebra provides useful tools to do 
this. 

As a basic example consider a railway, from A to B, 
a section of which is single track. For safety, there is a 


token that must be given by a signalman to the driver 
about to enter the single-track section. There is only 
one token, and when the driver leaves the section he 
returns it to a second signalman, who can give it to the 
driver of a train traveling in the opposite direction. 

Let the timetable be such that trains from either 
direction arrive at the single-track section every T 
hours, and let xa (k) be the time at which the fcth train 
from A to B enters the section, and similarly with xb ( k ) . 
The time taken for a train to traverse the section is t. 
Then, for 1 < k e N, 

xa ( k) = t ® xb ( fc - 1) © T® k , 

XB(k) = t ® xa (k) © T® k , 

with the initial condition xa(1) = T. The lack of sym- 
metry between these expressions arises because the 
first train is assumed to be traveling from A to B. In 
words, these formulas mean that a train cannot enter 
the section before it arrives there or before the previ- 
ous train traveling in the opposite direction has left. 
By substitution, the following linear nonhomogeneous 
equation for x^(k) is found: 

x A (k) = t ® 2 ® x A (k - 1) © t ® T 0k_1 © T® k . 

This equation can be solved by introducing x\{k) = 
r® 2k ®y(k) withy (1) = t ® -2 (8> T. Assuming that T > r 
gives 

k 

y(k) = y(k- 1) © A <sk =* y(k) = ® A®- 7 , 

j= 1 

where A = t ® -2 ® T = T - 2t. 

There are two cases that are distinguished by the sign 
of the parameter A. If A > 0, the sequence of A®- 7 ' is 
increasing, so that y(k) = A® k . In this case, therefore, 
x A (k) = T® k : trains enter the single-track section from 
A every T hours, i.e., at the same rate that they arrive 
from A. If A < 0, the sequence of A® J is now decreasing, 
so that y(k) = A. Hence xa ( k) = T ® r® 2(k_1) : trains 
enter the single-track section from A every 2t hours, 
and, since this is less than the rate at which they arrive 
from A, a queue develops. 

Calculations such as this can be used to assess the 
stability of timetables. Here, the parameter A is the dif- 
ference between the timetabled interval between trains 
arriving at the single-track section and the time it takes 
a token to return to the first signalman. A timetable that 
sets T to be such that A has a small positive value is 
vulnerable to unexpected delays in the single-track sec- 
tion. Large-scale timetabling of rail networks has been 
carried out using linear systems of max-plus equations. 
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5 Linear Algebra 

Matrices and vectors based on the semifield (R, ©,®) 
can be defined in the obvious way as A = ( ay ) G 
R mx ". In particular, when n = 1, this is an element of 
(R m , ©, ®) that, although it is not quite a vector space 
(the negative of a vector is not defined), is a semimodule 
defined over (R, ©, ®). 

Matrices can be combined using a natural gener- 
alization of the usual rules of matrix addition and 
multiplication: 

(A © B)ij — ay © bij , 
with A, B © R mxn , and 

i 

(A ® B)ij = a, ;i ® b\j © ■ ■ ■ © an ® bij = (J) a^ ® Z?jy, 

k= 1 

where A e R mxi and B e R ix ". 

A diagonal matrix, say D = diag(di, ...,d n ), has all 
its off-diagonal elements equal to e . In particular, the 
n x n identity matrix is 

I = diag(0 0). 

The only max-plus matrices that are invertible are 
the diagonal matrices and those nondiagonal matri- 
ces that can be obtained by permutation of the rows 
and columns of a diagonal matrix. The inverse of the 
diagonal matrix D is D® _1 = diag(-di, . . . , -d n ). 

The paucity of invertible max-plus matrices is due to 
the lack of subtraction, or additive inverses, as noted 
in section 1, but there are tools for solving systems of 
max-plus linear equations. Define conjugate operations 
on R u oo as 

a®' b = min{a, b}, 
a ®' b = a + b 

(with the convention that e ®' oo = oo and £ ® oo = e ), 
and extend these definitions to matrices and vectors as 
before. Consider the one-sided linear equation 

A® x = d, 

where x G R n is an unknown vector and d G R” 1 , A G 
i mx " are known. The product A®x is a max-plus linear 
combination of the columns of A, and so the existence 
of a solution corresponds to d being an element of the 
span of the columns of A. If a solution exists in R u oo, 
it is x = A~ ®' d, where A~ = -A T is the conjugate 
of A. Even if no solution exists, A~ ®' d is the greatest 
solution (with respect to the product order on R n ) of 
the system of inequalities A ® x ^ d. 



Figure 1 A heaps of pieces example. The two pieces 
are shown separately in (a), and the heap following the 
sequence of pieces rbrbrb is shown in (b). 

6 An Example Application 

Imagine a set of resources ft = {1 n} and a set 

of basic tasks JA = {coi, . . . , cu m }. For each task it 
is known which resources will be required, in what 
order, and for how long. This problem can be stud- 
ied using a heaps of pieces model, which resembles 
the game Tetris. Figure 1 illustrates this for a simple 
example based on two tasks coi = r and a)> = b and 
two resources labeled 1 and 2. Part (a) shows how the 
pieces are employed to represent the use of resources 
on the basic tasks: for r, resource 1 is required for one 
time unit and resource 2 is not required at all; for b, 
the task takes three time units, initially resource 2 is 
required, then resources 1 and 2 work in parallel, and 
finally resource 2 completes the task. Complex schedul- 
ing of these tasks is represented by piling the pieces 
into heaps such that each piece touches but does not 
overlap the pieces below it (unlike in Tetris, no rotation 
or horizontal movement of pieces is allowed). Part (b) 
shows the heap created when the pieces are introduced 
alternately, starting with r. 

Each piece is characterized by a pair of functions 
Z, u : ft — ■ R that give, respectively, the lower and upper 
contours of the piece, i.e., the earliest and latest times 
that each resource is in use. Since l and u are functions 
defined on a discrete set, ft, they will be treated as vec- 
tors with dimension equal to the cardinality of ft. If a 
piece does not use a given resource, the corresponding 
components of Z and u are set to e, so in the exam- 
ple, Z(r) = (0, f) T , u(r) = (l,f) T , 1(b) = (1,0) T , and 
u(b) = (2,3) t . Associated with each piece there is a 
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subset 5 c of resources that the corresponding task 
requires; in this example, Sir) = {ljandS(h) = {1,2}. 
Generally, for each piece, cu e JA, u(co) ^ Z(to), and 
l(io) is defined such that min^y,,) lk(co) = 0. 

These models can be used to study the efficiency of 
different scenarios, for example, introducing the tasks 
in various periodic patterns or randomly. The rate at 
which the maximum height of the heap grows indicates 
how long the total job takes as the number of repeti- 
tions of the tasks is increased. A comparison between 
the rates of growth of the maximum and minimum 
heights is a measure of how uniformly the work is 
distributed over the various resources available. 

A heap model can be expressed as a linear max-plus 
system. Consider the upper contour of a heap contain- 
ing p pieces, x(p) : H — ■ R. What is the new contour if a 
new piece is added? Let the new piece be called (Vj p+1 e 
JA. If this piece is placed at some height h , this can be 
represented by adding h to its upper and lower contour 
functions or using the max-plus notation: h ®u((Vj p+1 ) 
and h ® l(o>j p+l ). The smallest possible value of h such 
that h ® l(a>j p+1 ) ^ x(p) must now be chosen. This is 
achieved when maxi e s(w jp+1 ){Xi(p)-h®U( u)j p+1 ) } = 0 
or, in max-plus notation, 

h = ® { t 7 l (w Jp+1 ) ®Xi(p) = Z“(ay p+1 ) ®x(p), 

isS(w Jp+l ) 

where l~( (v Jp+1 ) is a row vector whose components are 
-Li if Li is finite and e otherwise. The upper contour of 
the heap is now given by 


begins 


x(l) 

x(2) 


1 2 
2 3 


and so on. 

The regularity of this sequence means that the matrix 
describing every second iterate is given by the product 
M(hr) = M(b ) ®M(r): 


M(br) 


1 2 
2 3 


1 £ 
£ 0 


2 2 
3 3 


It is interesting to note that the vector x(2) is an 
eigenvector of M(br): 



where the eigenvalue is 3. This means that, if pieces 
continue to be added alternately, the heap will grow in 
height linearly at a rate of 3 units every cycle. More 
generally, if the scheduling of tasks is periodic with 
period p and the sequence of tasks is given by w e JA P , 
then the growth rate of the completion time might 
be obtained by finding the eigenvalue of M{w). This 
leaves two questions: do iterates of the initial condition 
converge to an eigenvector of M(w), and does M(iv ) 
generally possess an eigenvalue (and, if so, how is it 
computed)? 


7 Eigenvalues and Eigenvectors 


x(p + 1) = M(w jp+1 ) ® x(p), (1) 

where M((t>) is an n x n matrix 
M(w) =1® u(io) ® 


In general, the eigenvalue problem for square matrices 
is defined as one would expect. Given an n x n matrix 
A, look for a A e R and a nontrivial vector x e R” such 
that 

A ® x = A ® x. 


This is a linear equation that relates successive upper 
contours of the heap. 

Returning to the example, use of this formula gives 

M(r)=(l *), M(b) = Q 3 ). 

The iteration process given by equation (1) should 
begin with no pieces present, i.e., x(0) = (0, ..., 0) T , 
and requires a specification of the matrix at each step. 
The sequence of matrices corresponds to the sequence 
of pieces introduced. In figure 1, the example shows 
a periodic sequence alternating pieces r and b. This 


A graphical interpretation of A is helpful. Associate 
with A a weighted directed graph Ga with n vertices, 
such that, for each ay =t= e there is an edge from 
the vertex j to the vertex i. Each edge is assigned 
a weight given by the corresponding matrix element 
&ij. Conventionally, these are known as communication 
graphs. The following theorem links the existence of a 
(unique) eigenvalue of a matrix to the structure of the 
corresponding communication graph. 

Theorem 1 . If Ga is strongly connected, then A has a 
unique, nonzero, i.e., not equal to e, eigenvalue equal 
to the maximal average weight of circuits in Ga- 
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Figure 2 The communication graph of the matrix M( hr). 

A circuit is a path in Ga that leads from a vertex back 
to itself. There are two obvious notions of the length 
of a circuit: the number of vertices or edges (generally 
known as the circuit length) and the sum of the weights 
on the edges (known as the circuit weight). The average 
weight of a circuit is found by dividing the latter by the 
former. 

Figure 2 shows the communication graph for the 
matrix M{br) in the heaps of pieces example. There 
are three circuits, two of which are loops containing 
one edge and one vertex each. The remaining circuit has 
two edges, and these edges connect both vertices of the 
graph. The maximum average weight of these circuits 
is 3, which is the eigenvalue found above by substitut- 
ing an eigenvector into the eigenvalue equation. 

8 Cyclicity and Convergence 

An elementary circuit of the communication graph Ga 
is a circuit that does not intersect itself. The cyclicity 
of a strongly connected graph is the greatest common 
divisor of the lengths of all its elementary circuits. If 
the graph consists of more than one strong component 
(i.e., more than one maximal strongly connected sub- 
graph), its cyclicity is the least common multiple of the 
cyclicities of its strong components. 

Cyclicity is a topological property as it depends only 
on the circuits and their lengths. The matrix A con- 
tains more information than this since it associates a 
weight with each edge of Ga- This information can be 
captured by considering the critical graph associated 
with Ga- The critical circuits of Ga are those elemen- 
tary circuits with maximum mean weight (i.e., the total 
weight divided by the length). The critical graph associ- 
ated with Ga is the subgraph consisting of vertices and 
directed edges found in the critical circuits. The cyclic- 
ity of the matrix A, denoted cr(A), is the cyclicity of the 
critical graph associated with Ga- 

As an example consider M(hr) and its correspond- 
ing communication graph shown in figure 2. The graph 
has two kinds of elementary circuit: the self-loops 
(weights 2 and 3; length 1) and the cycle involving both 


vertices (weight 5; length 2). There is one critical circuit: 
the self-loop with weight 3. The critical graph therefore 
consists of the vertex labeled “2” in figure 2 together 
with the self-loop. The cyclicity of this, and therefore 
of M( hr), is unity. 

Integer max-plus powers of an n x n matrix, A, can be 
defined recursively as in the scalar case. The following 
theorem is about the asymptotics of max-plus powers 
of a matrix and therefore provides information about 
the dynamics if a max-plus square matrix is applied 
repeatedly to a general vector. 

Theorem 2. Let Ga be strongly connected and let the 
matrix A have eigenvalue A and cyclicity cr(A). Then 
there exists a positive integer K such that 

^®(fc+cr(A)) _ ^®cr(A) g, ^®k 

for all k y K. 

The theorem guarantees that after sufficiently many 
iterations the system in question converges to a regular 
behavior dominated by the eigenvalue of the matrix. In 
the example and in terms of the upper contour of the 
heap, there is a if such that 

M((br) q+K ) ® x(0) = 3 ®«®x(K) 

for all positive integers q. The rate at which the maxi- 
mum height of the heap grows asymptotically is then 

Um \\M((br)‘ l+K ) ® x(0)ll max = 3 
<t-°° 2q 2’ 

where ||x|| max is the maximum component of x. The 
same calculation using the minimum component shows 
that it also grows at a rate \ . Note that theorem 2 
implies that these results are independent of the initial 
condition x(0). Together, theorems 1 and 2 provide the 
means to calculate these measures of efficiency for any 
periodic sequence of tasks. 

9 Stochastic Models 

If the heap model is of a system responding to exter- 
nal random influences, the iteration process given by 
equation (1) wall involve randomly chosen sequences of 
matrices rather than the periodic sequences discussed. 
What can now be said of the asymptotic behavior of 
the heap? Rather than eigenvalues, the behavior of long 
random sequences of matrices is determined by the 
Lyapunov exponents of the system. Assume that the 
matrices in the sequence are independent and identi- 
cally distributed and that they are regular with proba- 
bility one (each row contains at least one element that 
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is not f). Then the Lyapunov exponent A max defined as 


I xiq) Umax 
q 


— Amax 


(and analogously for Amin) exists for almost all se- 
quences and is independent of the initial (finite) choice 
of x(0). This result shows that, in principle, it is pos- 
sible to calculate the efficiency measures discussed 
above. The main difference is that, although the Lya- 
punov exponents exist, there is no nice prescription for 
their calculation. The development of efficient numer- 
ical algorithms is an open problem that the interested 
reader might, perhaps, wish to consider! 
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VII. 5 Evolving Social Networks, 
Attitudes, and Beliefs— and 
Counterterrorism 

Peter Grindrod 


1 Introduction 

The central objects of interest here are 

(i) an evolving digital network of peer-to-peer commu- 
nication, 

(ii) the dynamics of information, ideas, and beliefs that 
can propagate through that digital network, and 

(iii) how such networks become important in matters 
of national security and defense. 

The applications of this theory spread far beyond these 

topics though. For example, more than a quarter of all 


marketing and advertising spending in the United King- 
dom and the United States is now spent online: so dig- 
ital media marketing (“buzz” marketing), though in its 
infancy, requires a deeper understanding of the nature 
of social communication networks. This is an impor- 
tant area of complexity theory, since there is no closed 
theory (analogous to conservation laws for molecular 
dynamics or chemical reactions) available at the micro- 
scopic “unit” level. Instead, here we must consider irra- 
tional, inconsistent, and ever-changing people. More- 
over, while the passage of ideas is mediated by the 
networking behavior, the very existence of such ideas 
may cause communication to take place: systems can 
therefore be fully coupled. 

In observing peer-to-peer communication in mobile 
phone networks, messaging, email, and online chats, 
the size of communities is a substantial challenge. 
Equally, from a conceptual modeling perspective, it 
is clear that being able to simulate, anticipate, and 
infer behavior in real time, or on short timescales, 
may be critical in designing interventions or spotting 
sudden aberrations. This field therefore requires and 
has inspired new ideas in both applied mathematical 
models and methods. 

2 Evolving Networks in 
Continuous and Discrete Time 

Consider a population of N individuals (agents/actors) 
connected through a dynamically evolving undirected 
network representing pairwise voice calls or online 
chats. Let A(t) denote the N x N binary adjacency 
matrix for this network at time t, having a zero diago- 
nal. At future times, A(t) is a stochastic object defined 
by a probability distribution over the set of all possible 
adjacency matrices. Each edge within this network will 
be assumed to evolve independently over time, though 
it is conditionally dependent upon the current network 
(so any edges conditional on related current substruc- 
tures may well be highly correlated over time). Rather 
than model a full probability distribution for future net- 
work evolution, conditional on its current structure, 
say Tst(A(t + St) | A(t)), it is enough to specify its 
expected value E{A(t + St) \ A(t)) (a matrix contain- 
ing all edge probabilities, from which edges may be 
generated independently). Their equivalence is trivial, 
since 

E(A(t + St) | A(t)) = Y.B'PstiB j A(t)), 

B 
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and 

N-l ,N 

T st (B | A(t)) = n 

i=l,j=i+l 

where IK = E(A(t + St) \ A(t)). Hence we shall specify 
our model for the stochastic network evolution via 

E(A(t + St) | A(t)) = A(t) + Stf(A(t)), (1) 

valid as St — 0. Here the real matrix-valued function J 
is symmetric, it has a zero diagonal, and all elements 
within the right-hand side will be in [0, 1]. We write 

?(A(t)) = -A(t) o cv(A(t)) + (C - A(t)) o <x(A(t)). 

Here C denotes the adjacency matrix for the clique 
where all ^N(N - 1) edges are present (all elements are 
Is except for Os on the diagonal), so C - A(t) denotes 
the adjacency matrix for the graph complement of A(t ) ; 
io(A(t)) and a(A(t)) are both real nonnegative sym- 
metric matrix functions containing conditional edge 
death rates and conditional edge birth rates, respec- 
tively; and ° denotes the Hadamard, or element-wise, 
matrix product. 

In many cases we can usefully consider a discrete- 
time version of the above evolution. Let {A k } k= i denote 
an ordered sequence of adjacency matrices (binary, 
symmetric with zero diagonals) representing a discrete- 
time evolving network with value A k at time step f 
We shall then assume that edges evolve independently 
from time step to time step with each new network con- 
ditionally dependent on the previous one. A first-order 
model is given by a Markov process 

E(A k+l \A k ) = A k °(C-w(A k )) + (C-Ak)°6dA k ). (2) 

Here ib(Ak) is a real nonnegative symmetric matrix 
function containing conditional death probabilities, 
each in [0,1], and &(Ak ) is a real nonnegative symmet- 
ric matrix function containing conditional edge birth 
probabilities, each in [0,1]. 

As before, the edge independence assumption im- 
plies that P(Ak+i I A k ) can be reconstructed from 
E{Ak + 1 I A k ). 

A generalization of katz [IV.18 §3.4] centrality for 
such discrete-time evolving networks can be obtained. 
In particular, if 0 < p < 1/ max{p(Ai;)}, then the 
communicability matrix 

£ = (I ~ MiTV - pAs)" 1 ■■■(/- pAx )- 1 

provides a weighted count of all possible dynamic 
paths between all pairs of vertices. It is nonsymmetric 
(due to time’s arrow) and its row sums represent the 
abilities of the corresponding people to send messages 


to others, while its column sums represent the abili- 
ties of the corresponding people to receive messages 
from others. Such performance measures are useful in 
identifying influential people within evolving networks. 
This idea has recently been extended so as to succes- 
sively discount the older networks in order to produce 
better inferences. 

3 Nonlinear Effects: Seen and Unseen 

In the sociology literature the simplest form of nonlin- 
earity occurs when people introduce their friends to 
each other. So, in (2), if two nonadjacent people are 
connected to a common friend at step k, then it is 
more likely that those two people will be directly con- 
nected at step k+ 1. To model this triad closure dynamic 
we may use db(Ak) = yC, so all edges have the same 
step-to-step death probability, ye [0,1], and 

a(Ak) = SC + eA\. 

Here 5 and e are positive and such that S + e(N- 2) < 1. 
The element (A 2 )ij counts the number of mutual con- 
nections that person i and person j have at step k. This 
equation is ergodic and yet it is destined to spend most 
of its time close to states where the density of edges 
means that there is a balance between edge births and 
deaths. A mean-field approach can be applied, approxi- 
mating A k with its expectation, which may be assumed 
to be of the form p k C (an erdos-renyi random graph 
[IV.18 §4.1] with edge density p k ). In the mean-field 
dynamic one obtains 

Pk+ 1 = Pk( 1 - y) + (1 - Pk)(5 + (N — 2 )sp k ). (3) 

If 5 is small and cv < \e(N — 2), then this nonlin- 
ear iteration has three fixed points: two stable ones, 
at 5/y + 0(5 2 ) and \ + - yl(E(N - 2))) 1 ' 2 + 0(5), 
and one unstable one in the middle. Thus the extracted 
mean-field behavior is bistable. In practice, one might 
observe the edge density of such a network approach- 
ing one or other stable mean-field equilibrium and jig- 
gling around it for a very long time, without any aware- 
ness that another type of orbit or pseudostable edge 
density could exist. Direct comparisons of transient 
orbits from (2), incorporating triad closure, with their 
mean-field approximations in (3) are very good over 
short to medium timescales. Yet though we have cap- 
tured the nonlinear effects well in (3), the stochastic 
nature of (2) must eventually cause orbits to diverge 
from the deterministic stability seen in (3). 

The phenomenon seen here explains the events of 
a new undergraduate’s first week at university are so 
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important in forming high-density connected social 
networks among student year groups. If we do not per- 
turb them with a mix of opportunities to meet, they 
may be condemned to remain close to the low-density 
(few-friends) state for a very long time. 

4 Fully Coupled Systems 

There is a large literature within psychology that is 
based on individuals’ attitudes and behaviors being in 
a tensioned equilibrium between excitory (activating) 
processes and inhibiting processes. Typically, the state 
of an individual is represented by a set of state vari- 
ables, some measuring activating elements and some 
measuring the inhibiting elements. 

Activator-inhibitor systems have had an impact 
within mathematical models where a uniformity equi- 
librium across a population of individual systems 
becomes destabilized by the very act of simple “pas- 
sive” coupling between them. Such Turing instabilities 
can sometimes seem counterintuitive. 

Homophily is a term that describes how associations 
are more likely to occur between people who have 
similar attitudes and views. Here we show how indi- 
viduals’ activator-inhibitor dynamics coupled through 
a homophilic evolving network produce systems that 
have pseudoperiodic consensus and fractionation. 

Consider a population of N identical individuals, 
each described by a set of m state variables that are 
continuous functions of time t. Let xft) e R m denote 
the ith individual’s attitudinal state. Let A(t) denote 
the adjacency matrix for the communication network, 
as it does in (1). Then consider 

N 

Xi = f(Xi) +D'^ j A i j(Xj - Xj), i= 1, ... ,1V. (4) 

J = i 

Here / is a given smooth field over R m , drawn from a 
class of activator-inhibitor systems, and is such that 
fix*) = 0 for some x* , and the Jacobian there, 
d/(x*), is a stability matrix (that is, all its eigenvalues 
have negative real parts). D is a real diagonal nonnega- 
tive matrix containing the maximal transmission coef- 
ficients (diffusion rates) for the corresponding attitu- 
dinal variables between adjacent neighbors. Let X(t) 
denote the mxN matrix with ith column givenby Xj(f), 
and let F(X) be the mxN matrix with ith column given 
by f(Xiit)). Then (4) may be written as 

X = F(X) - DX A. (5) 

Here A it) denotes the graph Laplacian for Ait), given 
by A(t) = Fit) -Ait), where T(t) is the diagonal matrix 


containing the degrees of the vertices. This system has 
an equilibrium at A = X* , say, where the ith column of 
X* is given by x* for all i = 1, ... ,1V. 

Now consider an evolution equation for Ait), in the 
form of (1), coupled to the states X: 

EiAit + St) | Ait)) 

= A(t) + Sti-Ait) ° iC - 4>iXit)))y 

+ iC - Ait)) o<P(X it) )S). (6) 

Here S and y are positive constants representing the 
maximum birth rate and the maximum death rate, 
respectively; and the homophily effects are governed by 
the pairwise similarity matrix, 4>(Xit)), such that each 
term 4>iXit))ij e [0, 1] is a monotonically decreasing 
function of a suitable seminorm ||Xj(i) - x,(t)||. We 
shall assume that 4>lX)t))ij ~ 1 for ||Xj(t)-X;(i)|| < e, 
and <P)X(t))ij = 0 otherwise, for some suitably chosen 
f > 0. 

There are equilibria at X = X* with either A = 0 or 
A = C (the full clique). To understand their stability, 
let us assume that 5 and y — 0. Then Ai t) evolves very 
slowly via (6). Let 0 = Ai ^ A 2 ^ ^ Ax be the eigen- 

values of A. Then it can be shown that X* is asymptot- 
ically stable only if all N matrices, d/(x*) - DA,-, are 
simultaneously stability matrices; and conversely, it is 
unstable in the ith mode of A if d/(x*) - D\i has an 
eigenvalue with positive real part. 

Now one can see the possible tension between 
homophily and the attitude dynamics. 

Consider the spectrum of d fix*) - DA as a function 
of A. If A is small then this is dominated by the stability 
of the uncoupled system, d/(x* ). If A is large, then this 
is again a stability matrix, since D is positive-definite. 
The situation, dependent on some collusion between 
choices of D and d fix*), where there is a window of 
instability for an intermediate range of A, is known as 
a Turing instability. Note that, as Ait) -» C, we have 
A j — IV , for i > 1 . So if N lies within the window of insta- 
bility, we are assured that the systems can never reach a 
stable consensual fully connected equilibrium. Instead, 
Turing instabilities can drive the breakup (weakening) 
of the network into relatively well-connected subnet- 
works. These in turn may restabilize the equilibrium 
dynamics (as the eigenvalues leave the window of insta- 
bility), and then the whole process can begin again as 
homophily causes any absent edges to reappear. Thus 
we expect a pseudocyclic emergence and diminution of 
patterns, representing transient variations in attitudes. 
In simulations, by projecting the network Ait) onto two 
dimensions using the Frobenius matrix inner product, 
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one may observe directly the cyclic nature of consensus 
and division. 

Even if the stochastic dynamics in (6) are replaced by 
deterministic dynamics for a weighted communication 
adjacency matrix, one obtains a system that exhibits 
aperiodic, wandering, and also sensitive dependence. In 
such cases the orbits are chaotic: we know that they will 
oscillate, but we cannot predict whether any specific 
individuals will become relatively inhibited or relatively 
activated within future cycles. This phenomenon even 
occurs when N = 2. 

These models show that, when individuals, who are 
each in a dynamic equilibrium between their acti- 
vational and inhibitory tendencies, are coupled in a 
homophilic way, we should expect a relative lack of 
global social convergence to be the norm. Radical and 
conservative behaviors can coexist across a population 
and are in a constant state of flux. While the macro- 
scopic situation is predictable, the journeys for indi- 
viduals are not, within both deterministic and stochas- 
tic versions of the model. There are some commen- 
tators in socioeconomic fields who assert that diver- 
gent attitudes, beliefs, and social norms require lead- 
ers and are imposed on populations; or else they are 
driven by partial experiences and events. But here we 
can see that the transient existence of locally clustered 
subgroups, holding diverse views, can be an emergent 
behavior within fully coupled systems. This can be the 
normal state of affairs within societies, even without 
externalities and forcing terms. 

Sociology studies have in the past focused on rather 
small groups of subjects under experimental condi- 
tions. Digital platforms and modern applied mathe- 
matics will transform this situation: computation and 
social science can use vast data sets from very large 
numbers of users of online platforms (Twitter, Face- 
book, blogs, group discussions, multiplayer online 
games) to analyze how norms, opinions, emotions, 
and collective action emerge and spread through local 
interactions. 

5 Networks on Security and Defense 

“It takes a network to defeat a network” is the man- 
tra expressed by the most senior U.S. command in 
Afghanistan and Iraq. This might equally be said of the 
threats posed by terrorists, or in post-conflict peace- 
keeping (theaters of asymmetric warfare), and even by 
the recent summer riots and looting within U.K. cities. 
But what type of networks must be defeated, and what 
type of network thinking will be required? 


So far we have discussed peer-to-peer networks in 
general terms. But we are faced with some specific 
challenges that stress the importance of social and 
communications networks in enabling terrorist threats: 

• the analysis of very large communications net- 
works, in real time; 

• the identification of influential individuals; 

• inferring how such networks should evolve in the 
future (and thus spotting aberrant behavior); and 

• recognizing that fully coupled systems may natu- 
rally lead to diverse views, and pattern formation. 

All of these things become ever more essential. Popula- 
tion-wide data from digital platforms requires efficient 
and effective applicable mathematics. 

Modern adversaries may be most likely to be 

• organized through an actor network of transient 
affiliations appropriate for (i) time-limited oppor- 
tunities and trophy or inspired goals; (ii) procure- 
ment, intelligence, reconnaissance and planning; 
and (iii) empowering individuals and encouraging 
both innovation and replication through competi- 
tion; 

• employing an operational digital communication 
network that enables and empowers action while 
maximizing agility (self-adaptation and reducing 
the time to act) through the flow of information, 
ideas, and innovations; and 

• reliant upon a third-party dissemination network 
within the public and media space (social media, 
broadcast media, and so forth) so as to maximize 
the impact of their actions. 

There are thus at least three independent networks 
operating on the side of those who would threaten 
security. 

Here we have set out a framework for analyzing the 
form and dynamics of large evolving peer-to-peer com- 
munications networks. It seems likely that the chal- 
lenge of modeling their behavior may lead us to develop 
new models and methods in the future. 
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VII.6 Chip Design 

Stephan Held, Stefan Hougardy, and 
Jens Vygen 


1 Introduction 

An integrated circuit or chip contains a collection of 
electronic circuits — composed of transistors — that are 
connected by wires to fulfill some desired functional- 
ity. The first integrated circuit was built in 1958 by 
Jack Kilby. It contained a single transistor. As predicted 
by Gordon Moore in 1965, the number of transistors 
per chip doubles roughly every two years. The pro- 
cess of creating chips soon became known as very- 
large-scale integration (VLSI). In 2014 the most complex 
chips contain billions of transistors on a few square 
centimeters. 

In this article we concentrate on the design of dig- 
ital logic chips. Analog integrated circuits have many 
fewer transistors and more complex design rales and 
are therefore still largely designed manually. In a mem- 
ory chip, the transistors are packed in a very regular 
structure, which makes their design rather easy. In con- 
trast, the design of VLSI digital logic chips is impossible 
without advanced mathematics. 

New technological challenges, exponentially increas- 
ing transistor counts, and shifting objectives like 
decreased power consumption or increased yield con- 
stantly create new and challenging mathematical prob- 
lems. This has made chip design one of the most inter- 
esting application areas for mathematics during the 
last forty years, and we expect this to continue to be 
the case for at least the next two decades, although 
technology scaling might slow down at some point. 


1.1 Hierarchical Chip Design 

Due to its enormous complexity, the design of VLSI 
chips is usually done hierarchically. A hierarchical 
design makes it possible to distribute the design task 
to different teams. Moreover, it can reduce the over- 
all effort, and it makes the design process more pre- 
dictable and more manageable. 

For hierarchical design, a chip is subdivided into log- 
ical units, each of which may be subdivided into sev- 
eral levels of smaller units. An obvious advantage of 
hierarchical design is that components that are used 
multiple times need to be designed only once. In par- 
ticular, almost all chips are designed based on a library 
of so-called books, predesigned integrated circuits that 
realize simple logical functions such as AND or NOT or 
a simple memory element. A chip often contains many 
instances of the same book; these instances are often 
called circuits. 

The books are composed of relatively few transistors 
and are predesigned at an early stage. For their design 
one needs to work at the transistor level and hence fol- 
low more complicated rules. Once a book (or any hier- 
archical unit) is designed, the properties it has that are 
relevant for the design of the next higher level (e.g., 
minimum-distance constraints, timing behavior, power 
consumption) are computed and stored. Most books 
are designed so that they have a rectangular shape and 
the same height, making it easier to place them in rows 
or columns. 

1.2 The Chip-Design Process 

The first step in chip design is the specification of 
the desired functionality and the technology that wall 
be used. In logic design, this functionality is made 
precise using some hardware description language. 
This hardware description is converted into a netlist 
that specifies which circuits have to be used and how 
they have to be connected to achieve the required 
functionality. 

The physical-design step takes this netlist as input 
and outputs the physical location of each circuit and 
each wire on the chip. It will also change the netlist 
(in a logically equivalent way) in order to meet timing 
constraints. 

Before fabricating the chip (or fixing a hierarchical 
unit for later use on the next level up), one conducts 
physical verification to confirm that the physical lay- 
out meets all constraints and implements the desired 
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functionality, and timing analysis checks that all sig- 
nals arrive in time. Further testing will be done with 
the hardware once a chip is manufactured. 

From a mathematical point of view, physical design 
is the most interesting part of chip design as it requires 
the solution of several different challenging mathemat- 
ical problems. We will therefore describe this in more 
detail below. 

1.3 Physical Design 

Two inputs of the physical-design stage are a netlist 
and a chip area. The netlist contains a set of circuits. 
Each circuit is an instance of a book and has some pins 
that must be connected to some other pins. Moreover, 
the netlist includes pins on the chip area that are called 
input/output-ports and connect the chip to the outside. 
The set of all pins is partitioned into nets. All pins that 
belong to the same net have to be connected to each 
other by wires. 

The task of the physical-design step is to assign a 
location to each circuit on the chip area ( placement ) and 
to specify locations for all the wires that are needed to 
realize the nethst (routing). Placement and routing are 
also called layout (see figure 1). 

A layout has to satisfy many constraints. For exam- 
ple, design rules specify the minimum width of a wire, 
the minimum distance between two different wires, or 
legal positions for the circuits. 

Moreover, a chip works correctly (at the desired clock 
frequency) only if all signals arrive in time (neither too 
early nor too late). This is described by timing con- 
straints. It is usually impossible to meet all timing con- 
straints without changing the netlist. This is called tim- 
ing optimization. Of course, any changes must ensure 
that the netlist remains logically equivalent. 

Due to the complexity of the physical-design prob- 
lem, placement, routing, and timing optimization are 
treated largely as independent subproblems, but they 
are of course not independent. Placement must ensure 
that a feasible routing can be found and that timing 
constraints can be met. Changes in timing optimization 
must be reflected by placement and routing. Finally, 
routing must also consider timing constraints. 

We describe some of the mathematical aspects of 
these three subproblems in what follows. 

2 Placement 

All circuits must be placed within the chip area so 
that no two circuits overlap. This is called a feasible 


(a) 


(b) 


Figure 1 (a) Placement and (b) routing of a chip with 

4 496 492 circuits and 5 091819 nets, with 762 meters of 
wires. Large rectangles are predesigned units, e.g., memory 
arrays or microprocessors. A Steiner tree connecting the 
five pins of one net is highlighted in (b). 

placement. In addition we want to minimize a given 
objective function. Normally, the chip area and all cir- 
cuits have a rectangular shape. Thus, finding a feasible 
placement is equivalent to placing a set of small rect- 
angles disjointly within some larger rectangle. This is 
known as the rectangle packing problem. 

No efficient algorithm is known that is guaranteed 
to solve the rectangle packing problem for all possi- 
ble instances. However, finding an arbitrary feasible 
placement is usually easy in practice. 
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2.1 Netlength Minimization 

The location of the circuits on the chip area is primarily 
responsible for the total wire length that is needed for 
wiring all the nets in the netlist. If the total wire length 
is too long, a chip will not be routable. Moreover, the 
length of the wires greatly impacts the signal delays 
and the power consumption of a chip. Thus, a reason- 
able objective function of the placement problem is to 
minimize the total wire length. 

As this cannot be computed efficiently, estimates are 
used. Most notably, the bounding box length of a net 
is obtained by taking half the perimeter of a smallest 
axis-parallel rectangle that contains all its pins. A com- 
monly used quadratic netlength estimate is obtained by 
summing up the squared Euclidean distances between 
each pair of pins in the net and dividing this value by 
one less than the number of pins. 

No efficient algorithms are known for finding a place- 
ment that minimizes the total netlength with respect 
to any such estimate, even if we ask only for a solu- 
tion that is worse than an optimum solution by an arbi- 
trarily large constant factor. Under additional assump- 
tions, such as that all circuits must have exactly the 
same size, one can find a placement in polynomial 
time whose total netlength is Otlog n ) worse than an 
optimum placement, where n denotes the number of 
circuits. 

2.2 Placement in Practice 

As mentioned above, finding an arbitrary feasible place- 
ment is usually easy. Moreover, one can define local 
changes to a feasible placement that results in another 
feasible placement. General local search-based heuris- 
tics (such as simulated annealing) can therefore be 
applied. However, such methods are prohibitively slow 
for today’s instances, with several million circuits. 

Another paradigm, motivated by some theoretical 
work, is called min-cut. Here, the netlist is partitioned 
into two parts, each with roughly half of the circuits, 
such that as few nets as possible cross the cut. The 
two parts will be placed on the left and right parts of 
the chip area and then partitioned further recursively. 
Unfortunately, the bipartitioning problem cannot be 
solved easily, and the overall paradigm lacks stability 
properties and the results are inferior. 

A third paradigm, analytical placement, is the one 
that is predominantly used in practice today. It begins 
by ignoring the constraint that circuits must not over- 
lap; minimizing netlength (bounding box or quadratic) 


is then relatively easy. For several reasons (it is faster 
to solve; it is more stable; it gives better spreading), 
quadratic netlength is minimized in practice. This is 
equivalent to solving a system of linear equations with 
a sparse positive-definite matrix. 

The placement that minimizes quadratic netlength 
typically has many overlapping circuits. Two strategies 
for working toward a feasible placement exist: either 
the objective function is modified in order to pull cir- 
cuits away from overloaded regions or a geometric par- 
titioning is done. For geometric partitioning one can 
assign the circuits efficiently to four quadrants (or more 
than four regions) such that no region contains more 
circuits than fit into it and such that the total (linear 
or quadratic) movement is minimized. The assigmnent 
to the regions can then be translated into a modified 
quadratic optimization problem. 

Both strategies (as well as min-cut placement) are 
iterated until the placement is close to legal. This 
roughly means until there exists a legal placement in 
which all circuits are placed nearby. This ends the 
global-placement phase. 

After global placement (whether analytic or min-cut), 
the solution must be legalized. Here, given an illegal 
placement as input, we ask for a legal placement that 
differs from the input as little as possible. The common 
measure is the sum of the squared distances. Unfortu- 
nately, only special cases of this problem can be solved 
optimally in polynomial time, even when all circuits 
have the same height and are to be arranged in rows. 

3 Routing 

In routing, we must connect the set of each net’s pins 
by wires. The positions of the pins are determined by 
the placement. Wires can run on different wiring planes 
(sometimes more than ten), which are separated by 
insulating material. Wires of adjacent planes can be 
connected by so-called vias. In almost all current tech- 
nologies, all wire segments run horizontally or verti- 
cally. For efficient packing, every plane is used predom- 
inantly in one direction; horizontal and vertical planes 
alternate. 

Wires can have different widths, complicated spacing 
requirements, and other rules to obey. Although impor- 
tant, such rules do not change the overall nature of the 
problem. 

Before all nets are routed, some areas are already 
used by power supply or clock grids. These too must 
be designed, but this task is still largely a manual one. 
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3.1 Steiner Trees 

The minimal connections for a net can be modeled as 
Steiner trees. A Steiner tree for a given set of terminals 
(pins) is a minimal connected graph containing these 
terminals and possibly other vertices (see figure 1). If 
wiring is restricted to predefined routing tracks, the 
space available for routing a single net can be mod- 
eled as an undirected graph. Finding a shortest Steiner 
tree for a given set of terminals in a graph is NP-hard, 
and the same holds even for shortest rectilinear Steiner 
trees in the plane. Moreover, shortest is not always best 
when it comes to meeting timing constraints, and the 
routing graph is huge (it can have more than 10 11 ver- 
tices). Therefore, routing algorithms mostly use fast 
variants of Dijkstra's algorithm in order to find a short- 
est path between two components and then compose 
the Steiner trees of such paths. If done carefully, this is 
at most a factor 2 (t - 1 ) / 1 worse than optimal, where 
t is the number of terminals (pins in the net). 

3.2 Packing Steiner Trees 

Since finding just one shortest Steiner tree is hard, it is 
not surprising that finding vertex-disjoint Steiner trees 
in a given graph is even harder. In fact, it is NP-hard 
even if every net has only two pins and the graph is a 
planar grid. Nevertheless, it would be possible to solve 
such problems if the instances were not too large. 

Current detailed routing algorithms route the nets 
essentially sequentially, revising earlier decisions as 
necessary ( rip-up and reroute). To speed up the sequen- 
tial routing approach and to improve the quality of 
results, a global routing step is performed at the begin- 
ning. Here, the routing space is modeled by a coarser 
graph, whose vertices normally correspond to rectan- 
gular areas (induced by a grid) on a certain plane. Two 
vertices are connected if they correspond to the same 
area on adjacent planes or to horizontally or vertically 
(depending on the routing direction of the plane) adja- 
cent areas on the same plane. Edges have capacities, 
depending on how many wires we can pack between 
the corresponding areas. 

Global routing then asks us to find a Steiner tree for 
each net such that the number (more generally, the 
total width) of Steiner trees using an edge does not 
exceed its capacity. This problem is still NP-hard, and 
the global routing graphs can still be large (they often 
have more than 10' vertices). Nevertheless, global rout- 
ing can be solved quite well in both theory and prac- 
tice; the best approach with a theoretical guarantee is 


based on first approximately solving a fractional relax- 
ation (called min-max resource sharing, a generaliza- 
tion of multicommodity flows), then applying random- 
ized rounding to obtain an integral solution, and finally 
correcting local violations (induced by rounding). 

Global routing is also done at earlier stages of the 
design flow, e.g., during placement, in order to esti- 
mate routability and exhibit areas with possible routing 
congestion. 

4 Timing Optimization 

A chip performs its computations in cycles. In each 
cycle electrical signals start from registers or chip 
inputs, traverse some circuits and nets, and finally 
enter registers or chip outputs. 

Timing optimization has to ensure that all signals 
arrive within a given cycle time. Under this constraint, 
the power consumption shall be minimized. However, 
achieving the cycle time is a difficult problem on its 
own. 

4.1 Logic Synthesis 

The structure of a Boolean circuit has a big impact on 
the performance and power consumption of a chip. On 
the one hand, the depth, i.e., the maximum number of 
logic circuits on a combinatorial path, should be small 
so that the cycle time is met. On the other hand, the 
total number of circuits to realize a function should 
not be too big. 

Almost all Boolean functions have a minimum rep- 
resentation size that is exponential in the number of 
input variables. Hence, functions that are realized in 
hardware are quite special. 

Some very special functions— such as adders, certain 
symmetric functions, or paths consisting alternately of 
AND and OR circuits— can be implemented optimally 
or near-optimally by divide-and-conquer or dynamic- 
programming algorithms, but general logic synthesis 
is done by (mostly local) heuristics today. 

4.2 Repeater Trees 

Another central task is to distribute a signal from a 
source to a set of sinks. As the delay along a wire grows 
almost quadratically with its length, repeaters, i.e., 
circuits implementing the identity function or inver- 
sion, have to be inserted to strengthen the signal and 
linearize the growth. 
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For a given Steiner tree, repeaters can be inserted 
arbitrarily close to optimally in polynomial time using 
dynamic programming. 

A more difficult problem asks for the structure of 
the Steiner tree (into which repeaters can be inserted). 
A minimum-length Steiner tree can have very long 
source-sink paths. In addition, every bifurcation from 
a path adds capacitance and delay. Trees should there- 
fore not only be short; they should also consist of short 
paths with few bifurcations. 

Combining approximation algorithms for minimum 
Steiner trees with Huffman coding, bicriteria algo- 
rithms can be derived, trading off total length and path 
delays. 

4.3 Circuit Sizing 

In circuit sizing, the channel widths of the underlying 
transistors are optimized. A wider channel charges the 
capacitance of the output net faster but increases the 
input capacitances and, thus, the delays of the prede- 
cessors. Assuming continuously scalable circuits and 
a simplified delay model, the problem of finding opti- 
mum sizes for all circuits can be transformed into a 
geometric program. This can be solved by interior-point 
methods or by the subgradient method and Lagrangian 
relaxation. 

However, rounding such a continuous solution to dis- 
crete circuit sizes can corrupt the result. Theoretical 
models for discrete timing optimization, such as the 
discrete time-cost trade-off problem, are not yet well 
understood. Local search is therefore used extensively 
for post-optimization. 

4.4 Clock-Tree Construction 

One of the few problems that can be solved efficiently 
in theory and in practice is clock skew scheduling. Each 
register triggers its stored bit once per cycle. The times 
at which the signals are released can be optimized such 
that the cycle time is minimized. To this end a register 
graph is constructed, with each register represented by 
a vertex. There is an arc if there is a signal path between 
the corresponding registers. An arc is weighted by the 
maximum delay of a path between the two registers. 
The minimum possible cycle time is now given by the 
maximum mean arc weight of a cycle in the graph. 
This reduces to the well-studied minimum mean cycle 
problem. 

The challenging problem is then to distribute a clock 
signal such that the optimal trigger times are met. 


Here, facility location algorithms for bottom-up tree 
construction are combined with dynamic programming 
for repeater insertion. 
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VII. 7 Color Spaces and Digital Imaging 

Nicholas J. Higham 


1 Vector Space Model of Color 

The human retina contains photoreceptors called cones 
and rods that act as sensors for the human imaging 
system. The cones come in three types, with responses 
that peak at wavelengths corresponding to red, green, 
and blue light, respectively (see figure 1). Rods are of 
a single type and produce only monochromatic vision; 
they are used mainly for night vision. Because there are 
three types of cones, color theory is replete with terms 
having the prefix “tri.” In particular, trichromacy, devel- 
oped by Young, Grassmann, Maxwell, and Helmholtz, 
is the theory that shows how to match any color with 
an appropriate mixture of just three suitably chosen 
primary colors. 

We can model the responses of the three types of 
cones to light by the integrals 

r'imax 

Ci(f)=\ Si(A)f (A) dA, i = l:3, (1) 

1 i min 

where / describes the spectral distribution of the light 
hitting the retina, st describes the sensitivity of the 
ith cone to different wavelengths, and [Aniin,A m ax] ~ 
[400 nm, 700 nm] is the interval of wavelengths of 
the visible spectrum. Note that this model is linear 
(■ Ci(f + g) = ci(f) + Ci(g)) and it projects the spec- 
trum onto the space spanned by the 5,- (A) — the “human 
visual subspace.” 
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Figure 1 Response curves for the cones and the rods 
(solid gray line). S, M, and L denote the cones most sensi- 
tive to short (blue, solid black line), medium (green, dotted 
black line), and long (red, dashed black line) wavelengths, 
respectively. (After Bowmaker, J. K., and H. J. A. Dartnall, 
1980, Visual pigments of rods and cones in a human retina, 
Journal of Physiology 298:501-11.) 

For computational purposes a grid of n equally 
spaced points A; on the interval [A m in,A m ax] is intro- 
duced, and the repeated rectangle (or midpoint) quad- 
rature rule is applied to (1), yielding 

c = S r f, c e R 3 , S e R nx3 , / e R n , 

where the lth column of the matrix S has samples of 
Si at the grid points, the vector / contains the values 
of /(A) at the grid points, and the vector c absorbs 
constants from the numerical integration. In practice, 
a value of n around 40 is typically used. 

Let the columns of P = [p i p 2 Ps\ e R nx3 represent 
color primaries, defined by the property that the 3x3 
matrix S T P is nonsingular. For example, p i, p 2 , and p 2 
could represent red, blue, and green, respectively. We 
can write 

S r f = S T P(S r P)- 1 S T f = S 7 Pa(f), (2) 

where a(f) = {S r P)^ 1 S T f e R 3 . This equation shows 
that the color of any spectrum / (or more precisely the 
response of the cones to that spectrum) can be matched 
by a linear combination, Pa(f), of the primaries. A 
complication is that we need all the components of a 
to be nonnegative for this argument, as negative inten- 
sities of primaries cannot be produced. A way around 
this problem is to write a(f ) = a\ - 0 , 2 , where a 1 con- 
tains the nonnegative components of a (/) and a 2 has 
positive components, and rewrite (2) as 

S T (f + Pa 2 ) = S T Pa 1 . 

This equation says that Pa\ matches f with appro- 
priate amounts of some of the primaries added. This 


rearrangement is a standard trick in colorimetry, which 
is the science of color measurement and description. 

To summarize, the color of a visible spectrum f 
can be matched by tristimulus values a(f) = A 7 f, 
where A r = ( S T P)~ 1 S T , because S 7 f = S J Pa(f). 
The columns of A e R 3xn are called (samples of) 
color-matching functions for the given primaries. 

To determine A, a human observer is asked to match 
light of single wavelengths Ai by twiddling knobs to 
mix light sources constituting the three primaries until 
a match is obtained. Light of a single wavelength cor- 
responds to a vector / = e*, where e, has a 1 in the 
ith position and zeros everywhere else, and the vec- 
tor a(f) = A T f that gives the match is therefore the 
ith column of A T . In this way we can determine the 
color-matching matrix A corresponding to the given 
primaries. 

This vector space model of color is powerful. For 
example, since the 3xn matrix S T has a nontrivial null 
space, it tells us that there exist spectra / and g with 
/ g such that S J f = S J g. Hence two colors can 
look the same to a human observer but have a different 
spectral decomposition, which is the phenomenon of 
metamerism. This is a good thing in the sense that color 
output systems (such as computer monitors) exploit 
metamerism to reproduce color. There is another form 
of metamerism that is not so welcome: when two col- 
ors appear to match under one light source but do not 
match under a different light source. An example of 
this is when you put on socks in the bedroom with the 
room lights on and they appear black, but when you 
view them in daylight one sock turns out to be blue. 

The use of linear algebra in understanding color was 
taken further by Jozef Cohen (1921-95), whose work is 
summarized in the posthumous book Visual Color and 
Color Mixture: The Fundamental Color Space (2001). 
Cohen stresses the importance of what he calls the 
“matrix R," defined by 

R = S(S T S)~ 1 S T = SS + , 

where S + denotes the moore-penrose pseudoinverse 
[IV.10 §7.3] of S. Mathematically, R is the orthogo- 
nal projector onto range(S). Cohen noted that R is 
independent of the choice of primaries used for color 
matching, that is, R is unchanged under transforma- 
tions S — SZ for nonsingular Z e R 3x3 and so is 
an invariant. He also showed how in the factorization 
S = QL, where Q 6 R nx3 has orthonormal columns 
and L e R 3x3 is lower triangular, the factor Q (which 
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he called F) plays an Important role in color theory 
through the use of “tricolor coordinates” Q T /. 

We do not all see color in the same way: about 8% of 
males and 0.5% of females are affected by color blind- 
ness. The first investigation into color vision deficien- 
cies was by Manchester chemist John Dalton (1766- 
1844), who described his own color blindness in a lec- 
ture to the Manchester Literary and Philosophical Soci- 
ety. He thought that his vitreous humor was tinted 
blue and instructed that his eyes be dissected after 
his death. No blue coloring was found but his eyes 
were preserved. A deoxyribonucleic acid (DNA) analy- 
sis in 1985 concluded that Dalton was a deuteranope, 
meaning that he lacked cones sensitive to the medium 
wavelengths (green). The color model and analogues 
of figure 1 for different cone deficiencies help us to 
understand color blindness. 



Figure 2 CIE RGB color-matching functions from the 1931 
standard. (File adapted from an original on Wikimedia 
Commons.) 


Given the emphasis in this section on trichromacy, 
one might wonder why printing is usually done with 
a four-color CMYK model when three colors should be 
enough. CMYK stands for cyan-magenta-yellow-black, 
and cyan, magenta, and yellow are complementary col- 
ors to red, green, and blue, respectively. Trichromatic 
theory says that a CMY system is entirely adequate for 
color matching, so the K component is redundant. The 
reason for using K is pragmatic. Producing black in a 
printing process by overlaying C, M, and Y color plates 
uses a lot of ink, makes the paper very wet, and does not 
produce a true, deep black due to imperfections in the 
inks. In CMYK printing, gray component replacement 
is used to replace proportions of the CMY components 
that produce gray with corresponding amounts of K. 
(A naive algorithm to convert from CMY to CMYK is 
K = mi n (C,M, Y), C - C - K, M - M - K, Y - Y - K, 
though in practice slightly different amounts of C, M, 
and Y are required to produce black.) 

2 Standardization 

The Commission Internationale de l’Eclairage (CIE) is 
responsible for standardization of color metrics and 
terminology. Figure 2 shows the standard RGB color- 
matching functions produced by the CIE in 1931 and 
1964. They are based on color-matching experiments 
and correspond to primaries at 700 nm (red), 546.1 nm 
(green), and 435.8 nm (blue). The red curve takes nega- 
tive values as shown in the figure, but nonnegative func- 
tions were preferred for calculations in the precom- 
puter era as they avoided the need for subtractions. So 



Figure 3 CIE XYZ color-matching functions from the 1931 
standard. (File adapted from an original on Wikimedia 
Commons.) 


a CIE XYZ space was defined that has nonnegative color- 
matching functions (see figure 3) and that is obtained 
via the linear mapping 1 


X 


0.49 0.31 0.20 


R 

Y 

= 

0.17697 0.81240 0.01063 


G 

Z 


0 0.01 0.99 


B 


Two of the choices made by the CIE that led to this 
transformation are that the Y component approxi- 
mates the perceived brightness, called the luminance, 
and that R = G = B = 1 corresponds to X = Y = Z = 1, 
which requires that the rows of the matrix sum to 1. 

Because the XYZ space is three dimensional it is 
not easy to visualize the subset of it corresponding to 


1. The coefficients of the matrix are written here in a way that indi- 
cates their known precision. Thus, for example, the (1,1) element is 
known to two significant digits but the (2,1) element is known to five 
significant digits. 
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the visible spectrum. It is common practice to use a 
projective transformation 


x = , v 

X + Y + Z’ y 


X+Y + Z 


(z = 1 - x - y) 


to produce a chromaticity diagram in terms of the 
(x,y) coordinates (see plate 12). The visible spectrum 
forms a convex set in the shape of a horseshoe. The 
curved boundary of the horseshoe is generated by light 
of a single wavelength (pure color) as it varies across the 
visible spectrum, while at the bottom the “purple line” 
is generated by combinations of red and blue light. The 
diagram represents color and not luminance, which is 
why there is no brown (a dark yellow). White is at ( 5 , 5 ) , 
and pure colors lying at opposite ends of a line passing 
through the white point are complementary: a combi- 
nation of them produces white. Any point outside this 
region represents an “imaginary color,” a distribution 
of light that is not visible to us. 

A common use of the chromaticity diagram is in 
reviews of cameras, scanners, displays, and printers, 
where the gamut of the device (the range of producible 
colors) is overlaid on the diagram. Generally, the closer 
the gamut is to the visible spectrum the better, but 
since images are passed along a chain of devices start- 
ing with a camera or scanner, a key question is how 
the gamuts of the devices compare and whether colors 
are faithfully translated from one device to another. 
Color management deals with these issues, through 
the use of International Color Consortium (ICC) pro- 
files that describe the color attributes of each device by 
defining a mapping between the device space and the 
CIE XYZ reference space. Calibrating a device involves 
solving nonlinear equations, which is typically done by 
newton’s method [11.28]. 


3 Nonlinearities 

So far, basic linear algebra and a projective transforma- 
tion have been all that we need to develop color theory, 
and one might hope that by using more sophisticated 
techniques from matrix analysis one can go further. To 
some extent, this is possible; for example, the Binet- 
Cauchy theorem on determinants finds application in 
several problems in colorimetry. But nonlinearities can- 
not be avoided for long because human eyes respond 
to light nonlinearly, in contrast to a digital camera’s 
sensor, which has a linear response. The relative dif- 
ference in brightness that we see between a dark cellar 
and bright sunlight is far smaller than the relative dif- 
ference in the respective number of photons reaching 


our eyes, and this needs to be incorporated into the 
model. One way of doing this is described in the next 
section. 

4 LAB Space 

A problem with the CIE XYZ and RGB spaces is that they 
are far from being perceptually uniform, which means 
that there is not a linear relation between distances in 
the tristimulus space and perceptual differences. This 
led the CIE to search for nonlinear transformations that 
give more uniform color spaces, and in 1976 they came 
up with two standardized systems, L*u*v* and L*a*b* 
(or LAB, pronounced “ell-A-B”). In the LAB space the L 
coordinate represents lightness, the A coordinate is on 
a green-magenta axis, and the B coordinate is on a blue- 
yellow axis. For a precise definition in terms of the XYZ 
space, if X n , Y n , and Z n are the tristimuli of white then 

L = 116/(17Y„) - 16, 

A = 500[/ (X/X n ) — f (Y / Y n )], 

B = 200[f (Y /Y n ) — f (Z / Z n )], 

where 

(x 1/3 , x^ 0.008856, 

/(x) = \ lfi 

[7.787x + jjg, x^ 0.008856. 

The cube root term tries to capture the nonlinear per- 
ceptual response of human vision to brightness. The 
two cases in the formula for / bring in a different for- 
mula for low tristimulus values, i.e., low light. The light- 
ness coordinate L ranges from 0 to 100. The A and 
B coordinates are typically in the range -128 to 128 
(though not explicitly constrained), and A = B = 0 
denotes lack of color, i.e., a shade of gray from black 
(L = 0) to white (L = 100). In colorimetry, color dif- 
ferences are expressed as Euclidean distances between 
LAB coordinates and are denoted by A E. 

An interesting application of LAB space is to the 
construction of color maps, which are used to map 
numbers to colors when plotting data. The most com- 
monly used color map is the rainbow color map, which 
starts at dark blue and progresses through cyan, green, 
yellow, orange, and red, through colors of increasing 
wavelength. In recent years the rainbow color map has 
been heavily criticized for a number of reasons, which 
include 

• it is not perceptually uniform, in that the colors 
appear to change at different rates in different 
regions (faster in the yellow, slower in the green); 
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• it is confusing, because people do not always 
remember the ordering of the colors, making inter- 
pretation of an image harder; 

• it loses information when printed on a mono- 
chrome printer, since high and low values map to 
similar shades of gray. 

These particular criticisms can be addressed by using a 
color map constructed in LAB space with colors having 
monotonically increasing L values and linearly spaced 
A and B values. Color maps based on such ideas have 
supplanted the once-ubiquitous rainbow color map 
as the default in MATLAB and in some visualization 
software. 

For image manipulation there are some obvious 
advantages to working in LAB space, as luminosity 
and color can easily be independently adjusted, which 
is not the case in RGB space. However, LAB space 
has some strange properties. For example, ( L,A,B ) = 
(0, 128, -128) represents a brilliant magenta as black 
as a cellar! LAB space contains many such imaginary 
colors that cannot exist and are not representable in 
RGB. For many years LAB was regarded as a rather eso- 
teric color space of use only for intermediate represen- 
tations in color management and the like, though it is 
supported in high-end software such as Adobe Photo- 
shop and the MATLAB Image Processing Toolbox. How- 
ever, in recent years this view has changed, as photog- 
raphers and retouchers have realized that LAB space, 
when used correctly, is a very powerful tool for manip- 
ulating digital images. The book Photoshop LAB Color 
(2006) by Dan Margulis describes the relevant tech- 
niques, which include ways to reduce noise (blur the 
A and B channels), massively increase color contrast 
(stretch the A and B channels), and change the color 
of colorful objects in a scene while leaving the less 
colorful objects apparently unchanged (linear transfor- 
mations of the A and B channels). As an example of 
the latter technique, plate 13(a) shows an RGB image 
of a building at the University of Manchester, while 
plate 13(b) shows the result of converting the image to 
LAB, flipping the sign of the A channel, then convert- 
ing back to RGB. The effect is to change the turquoise 
paint to pink without, apparently, significantly chang- 
ing any other color in the image including the blue sky. 
In truth, all the colors have changed, but mostly by 
such a small amount that the changes are not visible, 
due to the colors having small A components in LAB 
coordinates. 


5 JPEG 

JPEG is a compression scheme for RGB images that can 
greatly reduce file size, though it is lossy (throws infor- 
mation away). The JPEG process first converts from RGB 
to the YCbCr color space, where Y represents luminance 
and Cb and C r represent blue and red chrominances, 
respectively, using the linear transformation 


y 


0.299 

0.587 

0.114 

R 

c b 

= 

-0.1687 

-0.3313 

0.5 

G 

Cr 


0.5 

-0.4187 

-0.0813 

B 


The motivation for this transformation is that human 
vision has a poor response to spatial detail in colored 
areas of the same luminance, so the Cb and C r compo- 
nents can take greater compression than the Y compo- 
nent. The image is then broken up into 8x8 blocks and 
for each block a two-dimensional discrete cosine trans- 
form is applied to each of the components, after which 
the coefficients are rounded, more aggressively for the 
Cb and C r components. Of course, it is crucial that the 
3x3 matrix in the above transformation is nonsingu- 
lar, as the transformation needs to be inverted in order 
to decode a JPEG file. 

The later JPEG2000 standard replaces the discrete 
cosine transform with a wavelet transform [1.3 §3.3] 
and uses larger blocks. Despite the more sophisticated 
mathematics underlying it, JPEG2000 has not caught on 
as a general-purpose image format, but it is appropri- 
ate in special applications such as storing fingerprints, 
where it is much better than JPEG at reproducing edges. 

6 Nonlinear RGB 

The CIE LAB and XYZ color spaces are device inde- 
pendent: they are absolute color spaces defined with 
reference to a “standard human observer.” The RGB 
images one comes across in practice, such as JPEG 
images from a digital camera, are in device-dependent 
RGB spaces. These spaces are obtained from a linear 
transformation of CIE XYZ space followed by a non- 
linear transformation of each coordinate that modifies 
the gamma (brightness), analogously to the definition 
of the L channel in LAB space. They also have specifi- 
cations in ( x,y ) chromaticity coordinates of the pri- 
maries red, green, and blue, and the white point (as 
there is no unique definition of white). The most com- 
mon nonlinear RGB space is sRGB, defined by Hewlett- 
Packard and Microsoft in the late 1990s for use by con- 
sumer digital devices and now the default space for 
images on the Web. The sRGB gamut is the convex hull 
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of the red, green, and blue points, and it is shown on 
the chromaticity diagram in plate 12. 


7 Digital Image Manipulation 


Digital images are typically stored as 8-bit RGB images, 
that is, as arrays A e R mxnx 3 i w h ere aijk, = l : 3 , 
contain the RGB values of the (i, j) pixel. In practice, 
each element ayjt is an integer in the range 0 to 255, 
but for notational simplicity we will assume the range 
is instead [0,1]. Image manipulations correspond to 
transforming the array A to another array B, where 


bijk — fijk(U-ijk) 


for some functions fijk- In practice, the 3 mn func- 
tions fijk will be highly correlated. The simplest case 
is where fijk = f is independent of i, j , and k, and an 
example is fijk(a-ijk) = minlayit + 0. 2, 1), which bright- 
ens an image by increasing the RGB values of every pixel 
by 0.2. Another example is 


fijk(O-ijk) 


l^^ijk ^ijk V 0.5, 

j 1 — 2(1 — CLijk a ijk ^ 0.5. 


(3) 


This transformation increases contrast by making RGB 
components less than 0.5 smaller (darkening the pixel) 
and those greater than 0.5 larger (lightening the pixel). 
These kinds of global manipulations are offered by all 
programs for editing digital images, but the results 
they produce are usually crude and unprofessional. 
For realistic high-quality results, the transformations 
need to be local and, from the photographic point of 
view, proportional. For example, the brightening trans- 
formation above will change any RGB triplet ( r,g,b ) 
with min(r, g, b) ^ 0.8 to (1, 1, 1), which is pure white, 
almost certainly producing an artificial result. The 
transformation (3) that increases contrast will change 
the colors, since it modifies the RGB values at each pixel 
independently. 

Advanced digital image manipulation avoids these 
problems by various techniques, a very powerful one 
being to apply a global transformation through a mask 
that selectively reduces the effect of the transforma- 
tion in certain parts of the image. The power of this 
technique lies in the fact that the image itself can be 
used to construct the mask. 

However, there are situations where elementary cal- 
culations on images are useful. A security camera might 
detect movement by computing the variance of several 
images taken a short time apart. A less trivial example 
is where one wishes to photograph a busy scene such 


as an iconic building without the many people and vehi- 
cles that are moving through the scene at any one time. 
Here, a solution is to put the camera on a tripod and 
take a few tens of images, each a few seconds apart, and 
then take the median of the images (each pixel gets the 
median R, G, and B values of all the pixels in that loca- 
tion in the image). With luck, and assuming the lighting 
conditions remained constant through the procedure, 
the median pixel will in every case be the unobscured 
one that was wanted! 
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VII.8 Mathematical Image Processing 

Guillermo Sapiro 


1 What Is Image Processing? 

In order to illustrate what image processing is, let us 
use examples from different applications and from 
some superb contributions to image (and video) pro- 
cessing research from the last few decades. 

Without doubt, the most important contribution to 
the field is JPEG, the standard for image compres- 
sion. Together with bar code readers, JPEG is the most 
widely used algorithm in the area. Virtually all the 
images on our cell phones and on popular image- 
sharing Web sites such as Facebook and Flickr are 
compressed with JPEG. The basic idea behind JPEG 
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is to first transform every 8x8 image patch via a 
discrete cosine transform (DCT; the real-valued com- 
ponent of the fast Fourier transform), attempting to 
approximate the decorrelation that could be optimally 
achieved only via the data-dependent Karhunen-Loeve 
transform. The DCT coefficients are then quantized, 
and while a simple uniform quantization is used per 
coefficient position, interesting mathematical theory 
lies behind this in the form of the optimal Max-Lloyd 
quantizer (and vector quantization in general). Finally, 
the quantized coefficients are encoded via Huffman 
coding, putting to work one of the most successful 
(and, along with the Lempel-Ziv universal compression 
framework, most widely used) information theory algo- 
rithms. Part of the beauty of JPEG is its simplicity: the 
basic algorithm can be implemented in a morning by 
any undergraduate student, and of course it can be 
executed very efficiently for large images using simple 
hardware. And yet despite its simplicity, a lot of theory 
stands behind JPEG. 

Image compression does not stop with JPEG, of 
course, and we also have lossless compression tech- 
niques like JPEG-LS. This is another technique based 
on beautiful mathematics in the context modeling and 
Golomb coding areas, and it has been used by NASA’s 
Jet Propulsion Laboratory for both Mars rovers expe- 
ditions; the Mars rovers also incorporate a wavelets- 
based [1.3 §3.3] lossy compression algorithm (see fig- 
ure 1, which was acquired on Mars and transmitted to 
Earth with these compression techniques). The image 
in figure 1 is composed of multiple high-resolution 
images covering different regions of the scene (with 
some overlapping), aligned together to form a larger 
image. Such aligmnent is often obtained via nonlinear 
optimization techniques that penalize disagreements 
between the overlapping regions. Finally, videos are 
compressed via MPEG, in which, as in the JPEG-LS tech- 
nique, predictive coding plays a fundamental role. The 
idea behind predictive coding is to use past data (e.g., 
past frames in a video) to predict future ones, encoding 
only the difference. 

Consumers are familiar with image processing be- 
yond compression. We all go to the movies, and some 
often notice mistakes in the films they see: a cam- 
eraman who is accidentally included in a shot, say. 
In order to repair such mistakes, the areas of the 
corresponding objects have to be inpainted, or filled 
in, with information from the background or other 
sources (see plate 14). The mathematics behind image 
inpainting is borrowed from the calculus of variations, 



Figure 1 A beautiful image from Mars that we can enjoy 
thanks to image compression. (Image courtesy of NASA/Jet 
Propulsion Laboratory.) 


partial differential equations, transport problems, and 
the Navier-Stokes equations; think in terms of colors 
being transported rather than fluids. Some more recent 
techniques are based on exploiting the redundancy and 
self-similarity in images and automatically performing 
a cut-and-paste approach. This borrows ideas that go 
back to Shannon’s thoughts on the English language: 
by looking at the past in a given text, and construct- 
ing letter and word distributions, we can predict the 
future, or at least construct words and sentences that 
look like correct grammar. These techniques, which can 
also be formalized with tools from the calculus of vari- 
ations, are based on finding the best possible informa- 
tion, in the form of image regions, edges, or patches, to 
“copy” from, and then “pasting” such information into 
the zone to be inpainted. 

Image inpainting can be considered an extreme case 
of image denoising, where the “noise” is the region to 
be inpainted. Noisy images (in the more standard sense, 
where the noise is spread around the whole image 
(e.g., additive Gaussian or Poisson noise)) have also 
motivated considerable research in image processing, 
with one of the most famous mathematical approaches 
being total variation (TV). In TV we optimize for the 
absolute norm of the image gradient, this being a crit- 
ical regularizer for inverse problems, which are intrin- 
sically ill-posed. TV also appears in compressed sens- 
ing [VII. 10] and in image-segmentation formulations 
like that of Mumford and Shah, where the idea is 
to segment an image by fitting it with, for example, 
piecewise-constant functions. TV appears both because 
of its piecewise requirement (total variation zero for 
each piece) and the fact that the length of a curve can 
be measured via the TV of the corresponding indica- 
tor function. And yet TV is also part of the famous 
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(a) 



Figure 2 (a) Cryotomography of HIV viruses (one slice of 
the actual three-dimensional volume) and (b) a reconstruc- 
tion from such data of the HIV env protein via mathe- 
matical image processing. The very noisy HIV particles, as 
they appear in the raw data illustrated in (a), are classi- 
fied and aligned/registered to obtain the three-dimensional 
reconstructed HIV env protein. (Courtesy of Liu, Bartesaghi, 
Borgnia, Subramaniam, and Sapiro.) 

active contour model, where a curve is geometrically 
deformed toward object boundaries, an effect that can 
be achieved by computing geodesics (plate 1 5). 

Special effects are very useful in images and movies. 
We can play with them ourselves, at an amateur level, 
on our computers, but at the professional level they 
are included in virtually all current movies (virtually all 
movies are now “touched up” by mathematical imag- 
ing tools before they are released). At the core of this 
is the idea of image segmentation, where we isolate 
objects and then paste them into new backgrounds (see 
plate 16). 

While image processing is widely used in commercial 
applications such as those just mentioned, medical and 
biological imaging is another key area in which image 
processing has made an important contribution. In fig- 
ure 2 we see an example of human immunodeficiency 
virus (HIV) research. 

Finally, the analysis of images and videos is also key, 
e.g., identifying the objects in an image or the activity 
in a video (see plate 1 7). 

To recap, with image processing we go to Hollywood, 
Mars, and the hospital, and mathematics is everywhere. 
Problems range from compression to reconstruction 
and recognition, and as we will discuss further below, 
challenging problems emerge for a number of branches 
of applied mathematics. 

1.1 Who Uses Image Processing? 

The answer to this question is “everybody,” as we have 
seen above. Consumers use image processing every 


time they take a digital picture, and they experience 
its benefits every time they go to the movies. Doctors 
increasingly use image processing for improving radi- 
ology pictures, as well as for performing automatic 
analysis. Surveillance applications are abundant. Indus- 
trial automatization is now a ubiquitous client of image 
processing. Imaging-related companies and, therefore, 
image processing itself are simply everywhere. 

2 Who Works in Image Processing? 

One of the beauties of this area is that regardless of 
one’s interest in applied mathematics, there is always 
an important and challenging problem and applica- 
tion in the area of image and video processing. Let us 
present a few examples. 

Harmonic analysis. Wavelets come to mind immedi- 
ately, of course; they are the basic component behind 
the JPEG2000 image compression standard and they 
are behind numerous other image reconstruction and 
enhancement techniques. Moreover, harmonic analy- 
sis is the precursor of compressed sensing and sparse 
modeling , two of the most successful and interest- 
ing ideas in image analysis, leading to the design of 
new cameras and state-of-the-art algorithms in appli- 
cations ranging from image denoising to object and 
activity recognition. 

The calculus of variations and partial differential 
equations. This is the mathematics behind image 
inpainting and leading image-segmentation tech- 
niques. An integral part of this is, of course, numeri- 
cal analysis, from classical techniques to more mod- 
ern ones such as level set methods. [11.24] 
Optimization. Image processing has been greatly influ- 
enced by modern optimization techniques, whether 
it be in the search for patches for image inpainting 
or in segmentation techniques based on graph parti- 
tions. Images are large, and efficient computational 
techniques are therefore critical to the success of 
image processing. 

Statistics and probability. Bayesian and non-Bayesian 
formulations appear all the time, in numerous prob- 
lems and to give different interpretations to other 
formulations, such as the standard optimization one 
for sparse modeling. 

Topology. Although not obviously related to image 
processing, topology, and in particular persistent 
homology, has made significant contributions to 
image processing. 
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But as well as benefiting from tremendous mathe- 
matical advances, image processing has also opened 
up new questions that have led to the development of 
new mathematical results, ranging from the viscosity 
solutions of partial differential equations to fundamen- 
tal theorems in harmonic analysis and optimization. 
There is therefore a perfect synergy between image pro- 
cessing and applied mathematics, each benefiting and 
feeding the other. 

3 Concluding Thoughts and Perspectives 

Digital images are more and more becoming part of our 
daily lives, and the more images we have, the more 
challenges we face. Applied mathematics has been 
critical to image processing, and all the best image- 
processing algorithms have fundamental mathematics 
behind them. Image processing also asks new and chal- 
lenging questions of mathematics, thereby opening the 
door to new and exciting results. As one famous math- 
ematician once told me, “If it works, there should be a 
math explanation behind it.” 
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VII.9 Medical Imaging 

Charles L. Epstein 


1 Introduction 

Over the past fifty years the processes and techniques 
of medical imaging have undergone a veritable explo- 


sion, calling into service increasingly sophisticated 
mathematical tools. Mathematics provides a language 
to describe the measurement processes that lead, even- 
tually, to algorithms for turning the raw data into 
high-quality images. There are four principal modali- 
ties in wide application today: X-ray computed tomog- 
raphy (X-ray CT), ultrasound, magnetic resonance imag- 
ing (MRI), and emission tomography (positron emission 
tomography (PET) and single-photon emission com- 
puted tomography (SPECT)). Each modality uses a dif- 
ferent physical process to produce image contrast: X- 
ray CT produces a map of the X-ray attenuation coeffi- 
cient, which is strongly correlated with density; ultra- 
sound images are produced by mapping absorption 
and reflection of acoustic waves; in their simplest form, 
magnetic resonance images show the density of water 
protons, but the subtlety of the underlying physics pro- 
vides many avenues for producing clinically meaning- 
ful contrasts in this modality; PET and SPECT give spa- 
tial maps of the chemical activity of metabolites, which 
are bound to radioactive elements. It has recently been 
found useful to merge different modalities. For exam- 
ple, a fused MRI/PET image shows metabolic activity 
produced by PET, at a fairly low spatial resolution, 
against the background of a detailed anatomic image 
produced by MRI. Plate 18 shows a PET image, a PET 
image fused with a CT image, and the CT image as well. 

In this article we consider mathematical aspects 
of PET, whose underlying physics we briefly explain. 
Positron emission is a mode of radioactive decay stem- 
ming from the reaction 

proton — neutron + positron + neutrino + energy. (1) 

Two isotopes, of clinical importance, that undergo this 
type of decay are F 18 and C 11 . The positron, which 
is the positively charged antiparticle of the electron, 
is typically very short-lived as it is annihilated, along 
with the first electron it encounters, producing a pair 
of 0.511 MeV photons. This usually happens within a 
millimeter or two of the site of the radioactive decay. 
Due to conservation of momentum, these two photons 
travel in nearly opposite directions along a straight line 
(see figure 1). The phenomenon of pair annihilation 
underlies the operation of a PET scanner. 

A short-lived isotope that undergoes the reaction 
in (1) is incorporated into a metabolite, e.g., fluo- 
rodeoxyglucose, which is then injected into the patient. 
This metabolite is taken up differentially by vari- 
ous structures in the body. For example, many types 
of cancerous tumors have a very rapid metabolism 
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and quickly take up available fluorodeoxyglucose. The 
detector in a PET scanner is a ring of scintillation crys- 
tals that surrounds some portion of the patient. The 
high-energy photon interacts with the crystal to pro- 
duce a flash of light. These flashes are fed into photo- 
multiplier tubes with electronics that localize, to some 
extent, where the flash of light occurred and measure 
the energy of the photon that produced it. Finally, dif- 
ferent arrival times are compared to determine which 
events are likely to be “coincidences,” caused by a sin- 
gle pair annihilation. Two photons detected within a 
time window of about 10 nanoseconds are assumed 
to be the result of a single annihilation event. The 
measured locations of a pair of coincident photons 
then determines a line. If the photons simply exited 
the patient's body without further interactions, then 
the annihilation event must have occurred somewhere 
along this line (see figure 1). It is not difficult to imag- 
ine that sufficiently many such measurements could 
be used to reconstruct an approximation for the dis- 
tribution of sources, a goal that is facilitated by a more 
quantitative model. 


2 A Quantitative Model 


Radioactive decay is usually modeled as a Poisson ran- 
dom process. Recall that Y is a Poisson random variable 
of intensity A if 

Prob(Y = k) = (2) 

k\ 


A simple calculation shows that £[7] = AandVar[Y] = 
A as well. Let H denote the region within the scanner 
that is occupied by the patient, and, for p e H, let pip) 
denote the concentration of radioactive metabolite as 
a function of position. If p is measured in the correct 
units, then the probability of k decay events originating 
from a small volume dV centered at p, in a time interval 
of unit length, is 


Prob(k; p) 


P(p) k e p{p) dv 

k\ 


(3) 


Decays originating at different spatial locations are 
regarded as independent events. 

Assume, for the moment, that 


(i) there are many decay events, so that we are justified 
in replacing this probabilistic law by its mean, p ( p ) , 

(ii) the high-energy photons simply exit the patient 
without interaction, and 

(iii) we are equally likely to detect a given decay event 
on any line passing through the source point. 



Figure 1 A radioactive decay leading to a positron-electron 
annihilation, exiting along £pq, which is detected as a 
coincidence event at P and Q in the detector ring. 


Let Tpq be the line joining the two detector positions 
P and Q where photons are simultaneously detected. 
With these assumptions we see that by counting up the 
coincidences observed at P and Q we are finding an 
approximation to the line integral 

Xp(H pq) = f pip) d l, 

Up Q 

where d l is the arc length along the line Tpq. This is 
nothing other than a sample of the three-dimensional 
X-ray transform of p, which, if it could be approxi- 
mately measured with sufficient accuracy, for a suffi- 
ciently dense set of lines, could then be inverted to pro- 
duce a good approximate value for p. This is essentially 
what is done in x-ray ct [VII. 19]. 

For the moment, we restrict our attention to lines that 
he in a plane no intersecting the patient and choose 
coordinates ( x,y,z ) so that tto = {z = zq}. The lines 
in this plane are parametrized by an angle 0 e [0, tt] 
and a real number s , with 

^e,s = {(5 cos 0,5 sin0, zo) + t(- sin0, cos 0,0) : 

t G (-oo, oo)} 

(see figure 2). In terms of s and 0, the two-dimensional 
X-ray transform is given by the integral 

r oo 

Xp{s,6,zo) = p((5cos0,5sin0,zo) 

J — OO 

+ t( - sin0, cos 0, 0)) dt. 

The inverse of this transform is usually represented 
as a composition of two operations: a filter acting on 
X p (s , 0, zo ) in the s variable, followed by the back- 
projection operator. If g(s, 6) is a function on the space 
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Figure 2 The lines in the plane no are labeled by 9, the 
angle the normal makes with the x-axis, and s, the distance 
from the line to the origin. 


of lines in a plane, then the filter operation can be rep- 
resented by Jg(s, 9) = d s 3~f g(s, 0), where 3~{ is a con- 
stant multiple of the Hilbert transform acting in the s 
variable. The back-projection operator X* g defines a 
function of (x,y) € no that is the average of g over all 
lines passing through (x,y): 

X*g(x,y) = — f g(xcos 0 + y sind, 0) d0. 
n Jo 

Putting together the pieces, we get the filtered back- 
projection (FBP) operator, which inverts the two-dimen- 
sional X-ray transform: p(x,y, zq) = [X* ° JF] ■ Xp. By 
using this approach for a collection of parallel planes, 
the function p could be reconstructed in a volume. This 
provides a possible method for reconstruction of PET 
images, and indeed the discrete implementations of 
this method have been extensively studied. In the early 
days of PET imaging this approach was widely used, and 
it remains in use today. Note, however, that using only 
data from lines lying in a set of parallel planes is very 
wasteful and leads to images with low signal-to-noise 
ratio. 

Assumption (i) implies that our measurement is 
a good approximation to the X-ray transform of p, 
Xp(s, 0,z o). Because of the very high energies involved 
in positron emission radioactivity, only very small 
amounts of short-lived isotopes can be used. The mea- 
sured count rates are therefore low, which leads to mea- 
surements dominated by Poisson noise that are not a 
good approximation to the mean. Because the FBP algo- 
rithm involves a derivative in s, the data must be signif- 
icantly smoothed before this approach to image recon- 
struction can be applied. This produces low-resolution 
images that contain a variety of artifacts due to system- 
atic measurement errors, which we describe below. 

At this point it is useful to have a more accurate 
description of the scanner and the measured data. We 


Detector ring 



Figure 3 The detector ring is divided into finitely many 
detectors of finite size. Each pair ( di,dj ) defines a tube Tij 
in the region occupied by the patient. This region is divided 
into boxes { b *.}. 


model the detector as a cylindrical ring surrounding 
the patient, which is partitioned into a finite set of 
regions {di,...,d n }. The seamier can localize a scintil- 
lation event as having occurred in one of these regions, 
which we heretofore refer to as detectors. This instru- 
ment design suggests that we divide the volume inside 
the detector ring into a collection of tubes, {Tij}, with 
each tube defined as the union of lines joining points in 
di to points in dj (see figure 3). A measurement ny is 
the number of coincidence events observed by the pair 
of detectors (di, dj). The simplest interpretation of ny 
is as a sample of a Poisson random variable with mean 
proportional to 


J-ipn cT, 


Xp(t pq). 


(4) 


Below we will see that this interpretation requires 
several adjustments. 

Assumption (ii) fails as the photons tend to interact 
quite a lot with the bulk of the patient’s body. Large 
fractions of the photons are absorbed, or scattered, 
with each member of an annihilation pair meeting its 
fate independently of the other. This leads to three 
distinct types of measurement errors. 
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Randoms. These are coincidences that are observed by 
a pair of detectors but that do not correspond to a 
single annihilation event. These can account for 10- 
30% of the observed events (see figure 4(a)). 

Scatter. If one or both photons is/are scattered and 
then both are detected, this may register as a coin- 
cidence at a pair of detectors ( di,dj ), but the anni- 
hilation event did not occur at a point lying near Tij 
(see figure 4(b)). 

Attenuation. Most photon pairs (often 95%) are simply 
absorbed, leading to a substantial underestimate of 
the number of events occurring along a given line. 

Below we discuss how the effects of these sorts of mea- 
surement errors can be incorporated into the model 
and the reconstruction algorithm. To get quantita- 
tively meaningful, artifact-free images, these errors 
must be corrected before application of any image- 
reconstruction method. 

Assumption (iii) is false in that the detector array, 
which is usually a ring of scintillation counters, en- 
closes only part of the patient. Many lines through the 
patient will therefore be disjoint from the detector only 
intersect it or at one end. This problem can, to some 
extent, be mitigated by using only observations coming 
from lines that lie in planes that intersect the detec- 
tor in a closed curve. If the detector is a section of a 
cylinder, then each point p lies in a collection of such 
planes whose normal vectors {v^} fill a disk 

D p lying on the unit sphere. If p^lp) denotes the 
approximate value for p(p) determined using the FBP 
algorithm in the plane , then an approximate value 
with improved signal-to-noise ratio is obtained as the 
average: 

Pip) = f p qj ^(p)dS(qj,<l>), 

\Dp\ Jd p 

where dS(ip,4>) is the spherical areal measure. A par- 
ticular implementation of this idea that is often used 
in PET scanners goes under the name of the “Col- 
sher filter.” Other methods use a collection of paral- 
lel two-dimensional planes to reconstruct an approx- 
imate image from which the missing data for the 
three-dimensional X-ray transform can then be approx- 
imately computed. 

In addition to these inherent physical limitations on 
the measurement process, there are a wide range of 
instrumentation problems connected to the detection 
and spatial localization of high-energy photons, as well 
as the discrimination of coincidence events. Effective 




Figure 4 The measurement process in PET scanners is 
subject to a variety of systematic errors, (a) Randoms are 
detected coincidences that do not result from a single decay 
event, (b) Scatter is the result of one or both of the pho- 
tons scattering off an object before being detected as a 
coincidence event but along the wrong line. 

solutions to these problems are central to the success 
of a PET scanner, but they are beyond the scope of this 
article. 

3 Correcting Measurement Errors 

To reconstruct images that are quantitatively meaning- 
ful and reasonably free of artifacts, the measured data 
{riij} must first be corrected for randoms, scatter, and 
attenuation (see figure 4). This requires both additional 
measurements and models for the processes that lead 
to these errors. 
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3.1 Randoms 


We first discuss how to correct for randoms. Let Ry 
denote the number of coincidences detected on the pair 
(di,dj) that are not caused by a decay event in Ty. In 
practice, coincidences are considered to be two events 
that are observed within a certain time window t (usu- 
ally about 10 nanoseconds). In addition to coincidences 
between two detectors, the numbers of single counts, 
{rii}, observed at {d;} are recorded. In fact, the number 
of “singles” is usually one or two orders of magnitude 
larger than the number of coincidences. From the mea- 
sured number of singles observed over a known period 
of time we can infer rates of singles events {r/ } for each 
detector. Assuming that each of these singles processes 
is independent, a reasonable estimate for the number 
of coincidences observed on the detector pair (di,dj) 
over the course of T units of time that are actually ran- 
doms is Ry = TTriVj. A somewhat more accurate esti- 
mate is obtained if one accounts for the decay of the 
radioactive source. 

There are other measurement techniques for esti- 
mating Ry, though these estimates tend to be rather 
noisy. Simply subtracting Ry from ny can increase 
the noise in the measurements and also change their 
basic statistical properties. There is a useful technique 
for replacing Ry with a lower-variance estimate. Let 
A be a collection of contiguous detectors including d,- 
that are joined to dj, and let B be a similar collection, 
including dj, that are joined to d/. Suppose that R mn 
are estimates for the randoms detected in the pairs 
{(d n ,d m ): n 6 B, me A}. The expression 


[SmeA Rim ] [’E.neB Rnj] 
XmeA, net? Rmn 


provides an estimate for Ry with reduced noise vari- 
ance. 


3.2 Scatter 

The next source of error we consider is scatter, which 
results from one or both photons in the annihilation 
pair scattering off some matter in the patient before 
being recorded as a coincidence at a pair of detectors 
( df,dj ). If the scattering angle is not small, then the 
annihilation event will not have occurred in expected 
tube Ty. Some part of ny, which we denote Sy , there- 
fore corresponds to radioactive decays that did not 
occur in Tij ■ Scattered photons tend to lose energy, 
so many approaches to estimating the amount of scat- 
ter are connected to measurement of the energies of 


detected photons. Depending on the design of the scan- 
ner, scatter can account for 15-40% of the observed 
coincidences. There are many methods for estimat- 
ing the contribution of scatter but most of them are 
related to the specific design of the PET scanner and 
are therefore beyond the purview of this article. 

3.3 Attenuation 

Once contributions from randoms and scatter have 
been removed to obtain corrected observations ny = 
ny - Ry - Sy, we still need to account for the fact that 
many photon pairs are simply absorbed. This process 
is described by Beer’s law, which is the basis for X-ray 
CT. Suppose that an annihilation event takes place at a 
point po within the patient and that the photons travel 
along the rays £+ and £- originating at po- The atten- 
uation coefficient is a function p(p) defined through- 
out the patient’s body such that the probabilities of 
detecting photons traveling along £± are 

p±=exp (-£/4 

As we “count” only coincidences, and the two pho- 
tons are independent, the probability of observing the 
coincidence is simply the product 

P+P- = exp ^ - J" p d( j , 

where £ = £+ u £-. In other words, the attenuation 
of coincidence counts due to photons traveling along 
£ does not depend on the location of the annihilation 
event along this line! This factor can therefore be mea- 
sured by observing the fraction of photons, of the given 
energy, emitted outside the patient’s body that pass 
through the patient along £ and are detected on the 
opposite side. 

For each pair (d,-, dj) we can therefore determine an 
attenuation coefficient <jy. The extent of the intersec- 
tion of Tij with the patient’s body has a marked effect 
on the size of qy , which can range from approximately 
1 for tubes lying close to the skin to approximately 0.1 
for those passing through a significant part of the body. 
The corrected data, which is passed to a reconstruction 
algorithm, is therefore 

jy. . _ _ ni j ~ RjJ ~ RjJ 

1J <nj an 

In addition to the corrections described above, there 
are a variety of adjustments that are needed to account 
for measurement errors attributable to the details of 
the behavior of the detector and the operation of the 
electronics. Applying an FBP algorithm to the corrected 
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data, we can obtain a discrete approximation p FBP (p) 
to p(p). In the next section we describe iterative algo- 
rithms for PET image reconstruction. While the FBP 
algorithm is linear, and efficient, iterative algorithms 
allow for incorporation of more information about the 
measurement process and are better suited to low 
signal-to-noise ratio data. 

4 Iterative Reconstruction Algorithms 

While filtered back projection provides a good start- 
ing point for image reconstruction in PET, the varying 
statistical properties of different measurements can- 
not be easily incorporated into this algorithm. A vari- 
ety of approaches have been developed that allow for 
the exploitation of such information. To describe these 
algorithms we need to provide a discrete measurement 
model that is somewhat different from that discussed 
above. The underlying idea is that we are developing a 
statistical estimator for the strengths of the Poisson 
processes that produce the observed measurements. 
Note that these measurements must still be corrected 
as described in the previous section. 

In the previous discussion we described the region, 
H, occupied by the patient as a continuum, with p(p) 
the strength of the radioactive decay processes, a con- 
tinuous function of a continuous variable p e H. We 
now divide the measurement volume into a finite col- 
lection of boxes, {£>i, . . . , ha}. The radioactive decay of 
the tracer in box b k is modeled as a Poisson process 
of strength A k . For each point p e H and each detec- 
tor pair ( di,dj ), we let c(p\ i,j ) denote the probability 
that a decay event at p is detected as a coincidence in 
this detector pair. The patient’s body will produce scat- 
ter and attenuation that will in turn alter the values 
of c{p\i,j) from what they would be in its absence, 
i.e., the area fraction of a small sphere centered at p 
intercepted by lines joining points in dj to points in dj. 

In the simplest case, the measurements {ny} would 
be interpreted as samples of Poisson random variables 
with means 

B 

X p(k-,i,j)\ k , 

k = 1 

where 

p(k; i,j) = 777^ c(p\i,j)dp 
V(b k ) Jb k 

is the probability that a decay event in b k is detected 
in the pair (di,dj). Here, V(b k ) is the volume of b k . 
Assuming that the attenuation coefficient does not vary 


rapidly within the tube Tij, we can incorporate attenu- 
ation into this model, as above, by replacing p(k\ i,j ) 
with p(k\ i,j) - cujp(k\i,j) = p a (k\i,j). Ignoring 
scatter and randoms, the expected value of ny would 
then satisfy 

B 

£[ny] = X P a (k\i,j)\k = fiij. 
k = 1 

In this model, scatter and randoms are regarded as 
independent Poisson processes, with means A s (i,j) 
and A r (i,j), respectively. Including these effects, we see 
that the measurement ny is then a sample of a Poisson 
random variable with mean ny + A s ( i, j ) + A r ( i, j ) . The 
reconstruction problem is then to infer estimates for 
the intensities of the sources { Afc } from the observa- 
tions { ny } . There are a variety of approaches to solving 
this problem. 

First we consider the reconstruction problem ignor- 
ing the contributions of scatter and randoms. The mea- 
surement model suggests that we look for a solution, 
( A *,..., Ajj ) , to the system of equations 

B 

n ij = X P a O<\Uj) Afc. 
k= 1 

If the array of detectors is three dimensional, there are 
likely to be many more detector pairs than boxes in the 
volume. The number of such pairs is quadratic in the 
number of detectors. This system of equations is there- 
fore highly overdetermined, so a least-squares solution 
is a reasonable choice. That is, A* could be defined as 

( B A 2 

A* = arg min X ny - X (k\ i,j)y k ■ 

{y:o<y t } Uj V k=1 > 

Note that we constrain the variables \y k ) to be non- 
negative, as this is certainly true of the actual intensi- 
ties. The least-squares solution can be interpreted as a 
maximum-likelihood (ME) estimate for A, when the like- 
lihood of observing n given the intensities y is given 
by the product of Gaussians: 

Lg(y) = n ex P [ - (ny - X P a (k;i,j)y k ) ]■ 

Uj 1 v k=i ' J 

It is assumed that the various observations are samples 
of independent processes. If we have estimates for the 
variances {cry} of these measurements, then we could 
instead consider a weighted least-squares solution and 
look for 

1 / B \2 

K,v = ar 8 min X — ny - X V a (.k\ i,j)y k ■ 

O': 00 *} Uj ny V fc=1 ’ 
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Because the data tends to be very noisy, in addi- 
tion to the “data term” many algorithms include a 
regularization term, such as 

B 

R(y) = X X \yk-yk’\ 2 , 

k= 1 keN(k) 

where for each k, the N(k) are the indices of the boxes 
contiguous to b k . The /1-regularized solution is then 
defined as 

K.v.P = arg™ 11 [ X J ( n ki ~ X P a (k]iJ)yk) 

+ my) ]■ 

As noted above, these tend to be very large systems of 
equations and are therefore usually solved via iterative 
methods. Indeed, a great deal of the research effort in 
PET is connected with finding data structures and algo- 
rithms to enable solution of such optimization prob- 
lems in a way that is fast enough and stable enough for 
real-time imaging applications. 

Given the nature of radioactive decay, it is perhaps 
more reasonable to consider an expression for the 
likelihood in terms of Poisson processes. With 

B 

v(Uj)(y) = 

k 

we get the Poisson likelihood function 

e -^dj)(y)[ A ,(t i j)(y)] n iJ 

n ij\ 

The expectation-maximization (EM) algorithm provides 
a means to iteratively find the nonnegative vector that 
maximizes log L p (y). After choosing a nonnegative 
starting vector y (0 \ the map from y m) to is 

given by the formula 

(»« + 1) _ (m) 1 \ ' T P a ~| 

k “ k ZyP a (k;i,j)^Lx fc p“(k;i,j)y< m) J' 

(5) 

This algorithm has several desirable features. Firstly, 
if the initial guess y (0) is positive, then the positiv- 
ity condition on the components of y (m) is automatic. 
Secondly, before convergence, the likelihood is mono- 
tonically increasing; that is, L(y (m) ) < L(y (m+1) ). 
Note that if ny = Zt p a (k; i, j)y k n ' > for all 
then y ( - m+1 ' 1 = jA m >. The algorithm defined in (5) 
converges too slowly to be practical in clinical appli- 
cations. There are several methods to accelerate its 
convergence, which also include regularization terms. 


L P (y) = Yl 



Figure 5 Two reconstructions from the same PET data illus- 
trating the superior noise and artifact suppression attain- 
able using iterative algorithms: (a) image reconstructed 
using the FBP algorithm and (b) image reconstructed using 
an iterative ML algorithm. Images courtesy of Dr. Joel Karp, 
Hospital of the University of Pennsylvania. 


We conclude this discussion by explaining how to 
include estimates for the contributions of scatter and 
randoms to ny in an ML reconstruction algorithm. We 
suppose that ny can be decomposed as a sum of three 
terms: 

B 

n ij = X P a (k m ,i,j)Ah + Rij + Sij. (6) 

k = 1 

With this decomposition, it is clear how to modify the 
update rule in (5): 

(m+l) _ { m ) 1 

' A ' * S Uj P a (k-iJ) 

x yf p a (k\i,j)n ij 

ij L Zk p a (fc; i, j )y k n] + R ij + Stj - 

In addition to the ML-based algorithms, there are many 
other iterative approaches to solving these optimiza- 
tion problems that go under the general rubric of 
“algebraic reconstruction techniques,” or ART. Figure 5 
shows two reconstructions of a PET image: part (a) is 
the result of using the FBP algorithm, while part (b) 
shows the output of an iterative ML-EM algorithm. 

5 Outlook 

PET imaging provides a method for directly visualiz- 
ing and spatially localizing metabolic processes. As is 
clear from our discussion, the physics involved in inter- 
preting the measurements and designing detectors is 
rather complicated. In this article we have touched on 
only some of the basic ideas used to model the mea- 
surements and develop reconstruction algorithms. The 
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FBP algorithm gives the most direct method for recon- 
structing images, but the images tend to have low res- 
olution, streaking artifacts, and noise. One can eas- 
ily incorporate much more of the physics into itera- 
tive techniques based on probabilistic models, and this 
should lead to much better images. Because of the large 
number of detector pairs for three-dimensional vol- 
umes, naive implementations of iterative algorithms 
require vast computational resources. At the time of 
writing, both reconstruction techniques and the modal- 
ity as a whole are rapidly evolving in response to the 
development of better detectors and faster computers, 
and because of increased storage capabilities. 

Further Reading 

A comprehensive overview of PET is given in Bailey 
et al. (2005). A review article on instrumentation and 
clinical applications is Muehllehner and Karp (2006). 
An early article on three-dimensional reconstruction is 
Colsher (1980). An early paper on the use of ML algo- 
rithms in PET is Vardi et al. (1985). Review articles on 
reconstruction methods from projections and recon- 
struction methods in PET are Lewitt and Matej (2003) 
and Reader and Zaidi (2007), respectively. 
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VII.10 Compressed Sensing 

Yonina C. Eldar 


1 Introduction 

Compressed sensing (CS) is an exciting, rapidly grow- 
ing held that has attracted considerable attention 
in electrical engineering, applied mathematics, statis- 
tics, and computer science. CS offers a framework 


for simultaneous sensing and compression of finite- 
dimensional vectors that relies on linear dimension- 
ality reduction. Quite surprisingly, it predicts that 
sparse high-dimensional signals can be recovered from 
highly incomplete measurements by using efficient 
algorithms. 

To be more specific, let x be an n-vector. In CS we do 
not measure x directly but instead acquire m < n lin- 
ear measurements of the form y = Ax using an m x n 
CS matrix A. Ideally, the matrix is designed to reduce 
the number of measurements as much as possible while 
allowing for recovery of a wide class of signals from 
their measurement vectors y. Thus, we would like to 
choose m <*c n. 

Since A has fewer rows than columns, it has a 
nonempty null space. This implies that for any partic- 
ular signal xo, an infinite number of signals x yield the 
same measurements y = Ax = Ax o. To enable recov- 
ery, we must therefore limit ourselves to a special class 
of input signals x. 

Sparsity is the most prevalent signal structure used 
in CS. In its simplest form, sparsity implies that x has 
only a small number of nonzero values but we do not 
know which entries are nonzero. Mathematically, we 
express this conditionas ||x||o ^ fc, where ||x||o denotes 
the fV“norm” of x, which counts the number of non- 
zeros in x (note that || ■ ||o is not a true norm, since in 
general ||ax||o ^ l«l l|x||o for a 6 R). More generally, 
CS ideas can be applied when a suitable representation 
of x is sparse. A signal x is k-sparse in a basis ¥ if there 
exists a vector 0 £l" with only k <k n nonzero entries 
such that x = ¥6. As an example, the success of many 
compression algorithms, such as jpeg 2000 [VII.7 § 5], 
is tied to the fact that natural images are often sparse 
in an appropriate wavelet transform. 

Finding a sparse vector x that satisfies the measure- 
ment equation y = Ax can be performed by an exhaus- 
tive search over all possible sets of size fe. In general, 
however, this is impractical; in fact, the task of finding 
such an x is known to be np-hard [1.4 §4.1]. The sur- 
prising result at the heart of CS is that, if x (or a suitable 
representation of x) is fc-sparse, then it can be recov- 
ered from y = Ax using a number of measurements 
m that is on the order of klog n, under certain condi- 
tions on the matrix A. Furthermore, recovery is possi- 
ble using polynomial-time algorithms that are robust to 
noise and mismodeling of x. In particular, the essential 
results hold when x is compressible, namely, when it 
is well approximated by its best k-term representation 
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W x ~ V W< where the norm in the objective is 

arbitrary. 

CS has led to a fundamentally new approach to signal 
processing, analog-to-digital converter (ADC) design, 
image recovery, and compression algorithms. Con- 
sumer electronics, civilian and military surveillance, 
medical imaging, radar, and many other applications 
rely on efficient sampling. Reducing the sampling rate 
in these applications by making efficient use of the 
available degrees of freedom can improve the user 
experience; increase data transfer; improve imaging 
quality; and reduce power, cost, and exposure time. 


2 The Design of Measurement Matrices 


The ability to recover x from a small number of mea- 
surements y = Ax depends on the properties of the CS 
matrix A. In particular, A should be designed so as to 
enable unique identification of a k-sparse signal x. Let 
the support S of x be the set of indices over which x is 
nonzero, and denote by xs the vector x restricted to its 
support. We similarly denote by As the columns of A 
corresponding to this support, so that y = Ax = AsXs- 
When the support is known, we can recover x from y 
via xs = (A|As) _1 Ajy, assuming that As has full col- 
umn rank. The difficulty in CS arises from the fact that 
the support of x is not known in advance. Therefore, 
determining conditions on A that ensure recovery is 
more involved. 

As a first step, we would like to choose A such that 
every two distinct signals x, x' that are k-sparse lead to 
different measurement vectors Ax * Ax' . This can be 
ensured if the spark of A satisfies spark(A) ^ 2k + 1, 
where spark(A) is the smallest number of columns 
of A that are linearly dependent. Since spark (A) e 
[2, m + 1], this yields the requirement that m ^ 2k. 

Unfortunately, computing the spark of a general 
matrix A has combinatorial computational complexity, 
since one must verify that all sets of columns of a cer- 
tain size are linearly independent. Instead, one can pro- 
vide (suboptimal) recovery guarantees using the coher- 
ence p(A), which is easily computable and is defined 
as 

\ajaj\ 


p(A) 


max 

1 ^i^j^n 


\CLi\\ 2 \\aj\\2 

where a, is the ith column of A. For any A, 

1 


spark (A) ^ 1 


p(A) ' 


Therefore, if 


fc < f ( 1 




(i) 


then for any y e R m there exists at most one k-sparse 
signal x e R” such that y = Ax. 

In order to ensure stable recovery from noisy mea- 
surements y = Ax + iv, where w represents noise, 
more stringent requirements on A are needed. One 
such condition is the restricted isometry property (RIP). 
A matrix A has the (k, 5)-RIP for 5 e (0, 1) if, for all 
k-sparse vectors x, 

(l-5)||x|ll«$ IIAxIll < (l + 5)||x|ll. 

This means that all submatrices of A of size m x k are 
close to an isometry and are therefore distance preserv- 
ing. Clearly, if A has the (2k, 5)-RIP with 0 < <5 < 1, then 
spark(A) ^ 2k + 1. 

The RIP enables recovery guarantees that are much 
stronger than those based on spark and coherence. 
Another property used to characterize A is the null- 
space condition. This requirement ensures that the null 
space of A does not contain vectors that are concen- 
trated on a small subset of indices. If a matrix sat- 
isfies the RIP, then it also has the null-space prop- 
erty. However, checking whether A satisfies either 
of these conditions has combinatorial computational 
complexity. 

Random matrices A of size mxn with m < n, whose 
entries are independent and identically distributed 
with continuous distributions, have spark(A) = m + 
1 with high probability. When the distribution has 
zero mean and finite variance, then in the asymptotic 
regime (as m and n grow) the coherence converges 
to p(A) = 2^1og n/m. Random matrices from Gauss- 
ian, Rademacher, or (more generally) a sub-Gaussian 
distribution have the (k,5)-RIP with high probability 
if m = (9(klog(n/k)/5 2 ). Similarly, it can be shown 
that a partial Fourier matrix with m = (9(fclog 4 n/5 2 ) 
rows, namely a matrix formed from the nx n Fourier 
matrix by taking m of its rows uniformly at random, 
satisfies the RIP of order k with high probability. A sim- 
ilar result holds for random submatrices of orthogonal 
matrices. 

There are also deterministic matrices that satisfy the 
spark and RIP conditions. For example, an mxn Van- 
dermonde matrix constructed from n distinct scalars 
has spark equal to m + 1. Unfortunately, these matri- 
ces are poorly conditioned for large values of n, ren- 
dering the recovery problem numerically unstable. It 
is also possible to construct deterministic CS matri- 
ces of size mxn that have the (fc,5)-RIP for fc = 
Oiyfm \ogmj login / m)). 
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3 Recovery Algorithms 

Many algorithms have been proposed to recover a 
sparse vector x from measurements y = Ax. When 
the measurements are noise free, and when A satisfies 
the spark requirement, the unique sparse vector x can 
be found by solving the optimization problem 

x = arg min ||x||o subject to y = Ax. (2) 

XGW n 

Solving (2) relies on an exhaustive search, so a vari- 
ety of computationally feasible alternatives have been 
developed. 

One popular approach to obtain a tractable problem 
is to replace the To-norm by the Ti-norm, which is con- 
vex. The resulting adaptation of (2), known as basis 
pursuit, is defined by 

x = arg min ||x|| i subject to y = Ax. 

XGW n 

This algorithm can be implemented as a linear pro- 
gram [IV.l 1 §3], making its computational complex- 
ity polynomial in n. Basis pursuit is easily modified 
to allow for noisy measurements by changing the con- 
straint to || y - Ax || 2 ^ £, where s is an appropriately 
chosen bound on the noise magnitude. The Lagrangian 
relaxation of the resulting problem is given by 

x = arg min ||x||i + A||y - Ax \\ 2 , 

XGR" 

and it is known as basis pursuit denoising (BPDN). Many 
fast methods have been developed in order to find 
BPDN solutions. 

An alternative to optimization-based techniques are 
greedy algorithms for sparse signal recovery. These 
methods are iterative in nature and select columns of A 
according to their correlation with the measurements 
y. Several greedy methods can be shown to have per- 
formance guarantees that match those obtained for 
BPDN. 

For example, the matching-pursuit and orthogonal 
matching pursuit algorithms proceed by finding the 
column aj of A most correlated to the signal residual, 
where 

lajrl 2 

j = arg max — - — =■ ■ 
i II II 2 

The residual r is obtained by subtracting the con- 
tribution of a partial estimate of the signal from y: 
r = y - As xs, where S is the current guess of the 
support set. The convergence criterion used to find 
sparse representations consists of checking whether 
y = Ax exactly or approximately. The difference 
between the two techniques is in the coefficient update 


stage. While in orthogonal matching pursuit in each 
stage all nonzero elements are chosen so as to mini- 
mize the residual error \\y - AsXs III, in matching pur- 
suit only the component associated with the currently 
selected column is updated to a^r l\\aj\\\. 

Another popular approach is known as iterative hard 
thresholding: starting from an initial estimate xo = 0, 
the algorithm iterates a gradient descent step followed 
by hard thresholding, i.e., 

Xi = H k (Xi-i + A T (y - AXi-i)). 

Here, H k { v) returns the k entries of v that are largest 
in absolute value. 

Many of the CS algorithms above come with guaran- 
tees on their performance. For example, basis pursuit 
and orthogonal matching pursuit recover a k-sparse 
vector from noiseless measurements when the matrix 
A satisfies (1). There also exist coherence-based guar- 
antees designed for measurements corrupted with arbi- 
trary noise. In general, though, results based on coher- 
ence typically suffer from the so-called square-root bot- 
tleneck: they require m = 0(k 2 ) measurements to 
ensure good recovery. 

Stronger guarantees are available based on the RIP, 
which motivates the popularity of random CS matrices. 
In particular, orthogonal matching pursuit recovers a 
k-sparse vector from exact measurements if A has the 
(k + 1, 5)-RIP with a small enough value of 5. More gen- 
erally, a sparse vector x can be recovered with small 
error from noisy measurements using iterative hard 
thresholding and BPDN when A has the (ck, S) -RIP, with 
appropriate values of c and 5. These results also hold 
when x is not exactly sparse but only compressible. The 
recovery error in this case is proportional to that of the 
best k-sparse approximation of x and to the norm of 
the noise. Since random matrices satisfying the RIP can 
be constructed as long as m = 0(klog(n/k)),itfollows 
that with high probability on the order of klog(n/k) 
measurements suffice to guarantee recovery of sparse 
vectors in the noise-free setting and to ensure recovery 
with small error in the noisy case. 

4 Applications 

4.1 Imaging 

One of the first applications of CS was the single-pixel 
camera. This camera uses a single photon detector (the 
single pixel) to measure m inner products of a desired 
image, represented by a vector x inR”, and a set of test 
vectors. Each vector represents the pattern of a digital 
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micromirror device, which consists of n tiny mirrors 
that are individually oriented in a pseudorandom fash- 
ion either toward the photodiode (representing a 1) or 
away from it (representing a 0). The incident light field 
is reflected off the digital micromirror device, collected 
by a lens, and then focused onto the photodiode, which 
computes the inner product between x and the ran- 
dom digital micromirror device pattern. This process is 
repeated m times with different patterns. Good recov- 
ery of the underlying image has been obtained using 
about 60% fewer random measurements than the num- 
ber of reconstructed pixels, assuming sparsity of the 
image in the wavelet domain. 

Another application in which the measurements are 
performed in the transform domain is magnetic res- 
onance imaging (MRI). In MRI the measurements cor- 
respond to samples of the image’s two-dimensional 
continuous Fourier transform. By exploiting the princi- 
ples of CS, one can recover an MRI image from fewer 
Fourier-domain measurements assuming sparsity of 
the image in an appropriate transform domain. For 
example, magnetic resonance angiography images are 
typically sparse in the pixel domain. The sparsity of 
these images can be increased by considering spatial 
finite differences. Brain MRIs are known to be sparse in 
the wavelet domain, and dynamic MRI is sparse in the 
temporal domain. 

MRI scanning time strictly depends on the number 
of samples taken during acquisition. Therefore, appli- 
cations of CS to MRI offer significant improvement in 
image acquisition speed. As performing an MRI scan 
currently takes at least thirty minutes, rapid MRI will 
reduce patient discomfort and image distortion due 
to patient movement during acquisition. An impor- 
tant factor affecting the performance of CS-based MRI 
recovery is the sampling trajectory chosen in the fre- 
quency domain. Pure random sampling is impractical, 
due to hardware and physiological constraints. This 
directly impacts the RIP and the coherence of the result- 
ing measurement matrix. Different applications of MRI 
impose varying constraints on the possible trajectories, 
which must be taken into account when designing a 
CS-based MRI system. 

4.2 Analog-to-Digital Conversion 

To date, essentially all ADCs follow the celebrated 
Shannon-Nyquist theorem, which states that, in order 
to avoid information loss when converting an analog 
signal to a digital one, the sampling rate must be at 


(a) (b) 



Figure 1 Sub-Nyquist hardware prototypes for 
(a) cognitive radio and (b) radar. 


least twice the signal bandwidth. Ongoing demand for 
data, as well as advances in radio frequency technol- 
ogy and the desire to improve resolution, have pro- 
moted the use of high-bandwidth signals. The resulting 
rates dictated by the Shannon-Nyquist theorem impose 
severe challenges both on the acquisition hardware and 
on subsequent storage and processing. 

Combining ideas from sampling theory with the prin- 
ciples of CS, several new paradigms have been devel- 
oped that allow the sampling and processing of a wide 
class of analog signals at sub-Nyquist rates using prac- 
tical hardware architectures. One such framework is 
referred to as Xampling, and it has led to sub-Nyquist 
prototypes for a variety of problems including cogni- 
tive radio, radar, ultrasound imaging, ultra-wideband 
communication, and more. Two of the hardware boards 
developed for cognitive radio and radar are presented 
in figure 1 . 

In a cognitive radio setting, the signal x(t) is modeled 
as a multiband input with sparse spectra, such that its 
continuous-time Fourier transform is supported on N 
frequency intervals with individual widths not exceed- 
ing B Hz. Each interval is centered around an unknown 
carrier frequency /,■ that is no larger than a maximum 
frequency /max- Using the Xampling paradigm, a sub- 
Nyquist prototype referred to as the modulated wide- 
band converter has been developed that can sample 
and process such signals at rates as low as 2NB, despite 
the fact that the signal may be spread over a very 
wide frequency range. This rate is much lower than the 
Nyquist rate, corresponding to /max- 

The modulated wideband converter modulates the 
incoming signal with a pseudorandom periodic se- 
quence, applies a lowpass filter to the result, and then 
samples the output at a low rate. The mixing opera- 
tion abases the spectrum to baseband with different 
weights for each frequency interval. The signal is recov- 
ered using CS techniques that account for the signal 
structure. The board in figure 1 samples signals with 
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Figure 2 Sub-Nyquist ultrasound imaging, (a) A standard 
image and (b) an image formed at a twenty-eighth of the 
Nyquist rate. 


a Nyquist rate of 2.4 GHz and a spectral occupancy of 
120 MHz at a rate of 280 MHz. 

Another signal class that can be sampled at sub- 
Nyquist rates are streams of pulses: 

i 

x(t) = Y, afh(t - t(), te[0, t], 

t=i 

where the time delays tg and amplitudes a/> are 
unknown. Such signals arise, for example, in communi- 
cation channels that introduce multipath fading, ultra- 
sound imaging, and radar. Here, again, the Xampling 
paradigm can be used to sample and process such sig- 
nals at rates as low as 2L/t irrespective of the signal’s 
bandwidth. 

The board in figure 1 allows, for example, detection 
of radar signals at a thirtieth of the signal’s Nyquist 
rate. Figure 2 demonstrates fast ultrasound imaging 
using Xampling. Part (a) shows an ultrasound frame 
obtained by standard imaging techniques, while the 
image in part (b) is formed from samples at a twenty- 
eighth of the Nyquist rate. All the processing is per- 
formed at this low rate as well. 

5 Extensions 

In recent years, the area of CS has branched out to many 
new fronts and has worked its way into several appli- 
cation areas. This, in turn, necessitates a fresh look at 
many of the basics of CS. 

A significant part of recent work on CS can be clas- 
sified into three major areas. The first of these con- 
sists of theory and applications related to CS matrices 
that are not completely random, or entirely determinis- 
tic, and that often exhibit considerable structure. This 
largely follows from efforts to model the way in which 
samples are acquired in practice, which leads to sens- 
ing matrices that inherit their structure from the real 
world. 


The second area includes signal representations that 
exhibit structure beyond sparsity as well as broader 
classes of signals, such as low-rank matrices and matrix 
completion, exploiting the distribution of the nonzero 
coefficients or other structured knowledge about the 
nonzero entries of x, and continuous-time signals with 
finite- or infinite-dimensional representations. In the 
context of analog signals, large amounts of effort are 
being devoted to the development of efficient ADC pro- 
totypes that achieve sub-Nyquist sampling in practice. 

Finally, a very recent trend in CS is to move away 
from the linear measurement model and consider var- 
ious types of nonlinear measurements. One particular 
example of this is phase-retrieval problems in which 
the measurements have the form yi = \ajx \ 2 for 
a set of vectors at. Note that only the magnitude 
of ajx is measured here, and not the phase. Phase- 
retrieval problems arise in many areas of optics, where 
the detector can measure only the magnitude of the 
received optical wave. Several important applications 
of phase retrieval include X-ray crystallography, trans- 
mission electron microscopy, and coherent diffractive 
imaging. Exploiting sparsity and ideas related to low- 
rank matrix representations results in efficient algo- 
rithms for phase retrieval with provable recovery guar- 
antees. Another example of nonlinear measurements 
are quantized measurements. 
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VII. 11 Programming Languages: 

An Applied Mathematics View 

Nicholas J. Higham 


The purpose of this article is to give an overview of 
computer programming languages from the point of 
view of applied mathematics. The historical develop- 
ment is emphasized because modern languages have 
been strongly influenced by those that came before, and 
indeed the oldest language of all, Fortran, is still widely 
used. Figure 1 shows the major relationships between 
the languages discussed in this article. 

1 The Early Days 

The first digital stored-program computers were pro- 
grammed by directly entering the low-level instruc- 
tions, represented by binary numbers, that the cen- 
tral processing unit understood (see figure 2). This was 
tedious and error prone, so assembly languages were 
developed that allowed the instructions to be entered 
as mnemonics, which were then translated by software 
(the assembler) into the corresponding binary instruc- 
tions. To add 5 to a number stored in a memory loca- 
tion, one might write a sequence of assembly language 
instructions such as LDA P (load the contents of mem- 
ory location P into the accumulator), ADC #5 (add 5 
to the accumulator), STA Q (store the contents of the 
accumulator in memory location Q). Assembly language 
requires the programmer to work at the level of indi- 
vidual machine instructions and is far removed from 
mathematical notation. 

It was a major step forward when John Backus and his 
colleagues at IBM designed the language Fortran and in 
1957 distributed a Fortran compiler for the IBM 704 
computer. A compiler translates a program written in a 
high-level language into a sequence of machine instruc- 
tions that can then be directly executed. Standing for 
“formula translation,” Fortran allowed mathematical 
expressions to be expressed in a natural algebraic nota- 
tion, such as Q = P + 5 for the example above. It also 
included many of the features we take for granted 
in programming languages today, such as loops, con- 
ditional tests, arrays, and elementary functions. For- 
tran was a huge success, and it became the first pro- 
gramming language to be standardized, as American 
National Standards Institute (ANSI) standard Fortran 66 
(where the digits denote the year of adoption of the 


standard). Standardization was important for expand- 
ing the number of compiler implementations of the lan- 
guage and aiding the portability of programs from one 
system to another. Fortran 66 included subroutines and 
functions, and also supported three floating-point data 
types: real, double precision, and complex. 

Subroutines and functions are examples of subpro- 
grams, sequences of code forming essentially separate 
programs that can be called with different input argu- 
ments from a main program or other subprogram. They 
are essential in mathematical computation for encap- 
sulating basic operations such as adding two vectors 
or finding a norm of a vector, as well as for higher- 
level tasks such as Ending the roots of a polynomial or 
solving a differential equation. Arguments to a subpro- 
gram can be passed in at least two ways. In call by value 
the argument is evaluated at the time of the subpro- 
gram call and its value is copied into the formal param- 
eter inside the subprogram. In call by reference the 
address of the parameter is passed, so that the actual 
and formal parameters effectively share the same mem- 
ory locations. An important difference between call by 
value and call by reference is that in the latter case 
any change made to the argument within the subpro- 
gram also changes the actual argument. In Fortran all 
parameters are passed by reference. 

The first commercial Fortran textbook was A Guide to 
Fortran Programming by Daniel McCracken, published 
in 1961. The author has stated that only a couple of 
programs in the book had been tested because machine 
time cost $46 per hour! 

Lisp, invented by John McCarthy in 1958, is the sec- 
ond oldest language still in wide use. The name stands 
for “list processor” and, as the name suggests, Lisp is 
based on list data structures. Lisp is well suited to func- 
tional programming, in which programs are entirely 
expressed in terms of mathematical functions (in par- 
ticular, function application is the only control struc- 
ture) and functions have no “side effects”; that is, they 
do not do anything except return a value. Lisp programs 
look completely different to those written in an impera- 
tive language such as Fortran, not least due to the pro- 
fusion of parentheses and the use of prefix notation 
(the sum 1 + 2 + 3 is expressed as (+ 1 2 3); see 
section 5.4). While Lisp is rarely used for floating-point 
computation, it is well suited to symbolic computa- 
tion, and it is the language in which the popular Emacs 
editor is mostly written. Lisp also has intrinsic math- 
ematical interest due to its close relation to Alonzo 
Church’s lambda calculus. The “if-then-else” construct 
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Figure 1 Time line of selected programming languages, with major influences denoted by arrows. 


first appeared in Lisp. Lisp has various dialects, includ- 
ing Scheme (used particularly for teaching) and Com- 
mon Lisp. 

Fortran had been developed in an ad hoc manner. 
A different language called Algol 60 was produced in 
1960 through the efforts of an international commit- 
tee. The language was described in a formal notation, 
later called Backus-Naur form. Algol 60 was based on 
nested blocks delimited by begi n and end statements, 
with the scope of a variable (the region of the program 
in which it is valid) restricted to the enclosing block, 
and it allowed for dynamic arrays, whose size is deter- 
mined during execution of the program. It became the 
“official” language for publishing mathematical soft- 
ware in the 1960s (notably for the first six years of the 
journal Communications of the ACM. , which began in 
1960) and a strong competitor to Fortran for practi- 
cal use. However, the language ultimately did not suc- 
ceed for a variety of reasons, including the fact that 
it did not define any input-output facilities (making it 
impossible to write a portable “Hello, world!” program). 


Nevertheless, some influential early mathematical soft- 
ware was published in Algol 60, notably linear algebra 
software in the journal Numerische Mathematik, later 
collected into a 1971 volume of the Handbook for Auto- 
matic Computation series. In late 2014 it was reported 
that a language called JOVIAL based on a 1958 version 
of Algol was still in use in the UK air traffic control 
system. 

Algol 60 greatly influenced future languages, such 
as Algol 68 (1968), a more rigorously defined language 
designed by a working group of the International Fed- 
eration for Information Processing, which was mainly 
of interest to computer science researchers, and Pas- 
cal, published in 1971 by Niklaus Wirth, which is a 
much smaller and simpler language than Algol 68. Pas- 
cal was widely taught in universities through the 1980s, 
as it promoted the notion of structured programming 
(see section 5.17) and so provided a way to avoid the 
hard-to-read “spaghetti code” that could easily be pro- 
duced in Fortran 66. It also achieved wide use in indus- 
try, thanks to the availability of compilers on early 
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Figure 2 The first program for a stored-program computer: 
a version of Tom Kilburn’s highest factor routine that first 
ran on the Manchester “Baby” on June 21, 1948. Taken from 
G. C. Tootill’s notebook. Copyright of The University of 
Manchester. 


PCs and Macintosh computers. However, Pascal was not 
well suited to numerical programming, not least due to 
its support for only one type of floating-point variable 
(real) and the absence of an exponentiation operator. 

An influential early textbook was George Forsythe 
and Cleve Moler’s Computer Solution of Algebraic Equa- 
tions, published in 1967. It contained listings of pro- 
grams written in Algol 60, Fortran, and PL/I for solv- 
ing a square linear system of equations Ax = b. 
(PL/I (1964) was a large language that was not widely 
adopted for scientific computing; Edsger Dijkstra said 
that “Using PL/I must be like flying a plane with 7,000 
buttons, switches, and handles to manipulate in the 
cockpit.”) These codes were part of a long sequence that 
led to the Fortran linear system solver in the LINPACK 
(1979) library. 

Basic was invented at Dartmouth College in 1964 by 
John Kemeny and Thomas Kurtz in order to teach pro- 
gramming to students who did not necessarily have a 
science background. At Dartmouth, Basic was used on a 
time-sharing system, which allowed the programmer to 
interact with the computer via a terminal, as opposed 
to the usual batch processing of the time, in which 
jobs were prepared on punched cards and handled by 
computer operators. The original Basic was in some 


respects a simplified Fortran, with only one data type 
(double precision), free-form input, the key word LET 
required before every assignment, numbered lines, and 
a GOTO command whose destination was a line number. 
Many early personal computers, including the IBM PC, 
provided versions of Basic (usually based on Microsoft 
Basic), typically built into the firmware of the machine. 
Visual Basic, introduced by Microsoft in 1991, included 
features to aid in the development of graphical user 
interface (GUI) applications, and it continues to exist 
as part of the .NET framework. Although a language 
often associated v\lth writing games (such as the classic 
Star Trek game originating on 1970s minicomputers), 
Basic was a capable language for numerical computa- 
tions, and its accessibility on microcomputers led to it 
being widely used in mathematical research and teach- 
ing, including by this author. Basic was often imple- 
mented with an interpreter, which translates and exe- 
cutes each statement in the source code before going 
on to the next statement. 

Another 1960s development was the language API, 
implemented at IBM in 1965. It takes its name, and 
much of its notation, from the 1962 book A Program- 
ming Language by Kenneth Iverson. It is unusual in 
using non-ASCII characters to represent operators and 
functions, which make possible very concise programs 
that are often criticized as being cryptic. The notation 
L ■ J and 1 ■ 1 for the floor and ceiling functions originates 
in Iverson’s book and is used (as functions L and 1) in 
APL. Indeed, APL has been described as an “executable 
notation.” APL has powerful array processing and is 
normally interpreted. It was never widely adopted but 
has been influential and is still in use today. 

2 The Modern Era 

The language C (1973), by Dennis Ritchie, is a com- 
pact language in which the Unix operating system was 
mainly implemented. C has float and long floating- 
point data types, corresponding to single and dou- 
ble precision, respectively. Arguments to C functions 
are passed by value, but a pointer can be passed in 
order to achieve call by reference. The syntax is terse 
and powerful. C has been remarkably successful for 
several reasons. First, its operations and types map 
directly to the hardware, making it easy to write pro- 
grams that carry out low-level system tasks and mak- 
ing it possible for compilers to produce very efficient 
code. Second, an ANSI/ISO standard was produced in 
1989 (and revised in 1999 and 2011), aiding portability 
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of programs. Third, C has remained more free from 
proprietary extensions than other languages. 

The language C++ by Bjarne Stroustrup (1985) is 
a descendant of C that is a superset, but for mi- 
nor details, and adds better type checking, flexi- 
ble data abstraction mechanisms, and support for 
object-oriented programming. Data abstraction allows 
the programmer to specify user-defined types, called 
classes in C++, and isolates how they are represented 
from how they are used; in other words, it hides the 
implementation details within the implementation of 
the types. Object-oriented programming is a method- 
ology based on a hierarchy of classes and objects, 
which are specific instances of the classes with their 
own characteristics. One of the most popular uses of 
object-oriented programming is in developing GUIs. 
C++ also supports “generic programming,” through the 
use of templates, whereby code can be written with 
parametrized types. 

Throughout the history of computing, new program- 
ming languages have regularly been designed, with vari- 
ous goals, including providing a better general-purpose 
language or providing a language tailored to specific 
purposes, such as system programming tasks. 

Java, developed by James Gosling at Sun Microsys- 
tems in 1995, is a widely used object-oriented lan- 
guage with a syntax similar to that of C++. It com- 
piles to a machine-independent bytecode that runs in 
the Java virtual machine (JVM), and a JVM is provided 
for each machine on which Java is to be used. The 
initial version of Java required bitwise reproducibility 
of floating-point arithmetic across different machines. 
While superficially an attractive feature, it inhibited 
common compiler optimizations as well as the use 
of extended precision registers. These over-restrictive 
floating-point semantics were relaxed in later versions 
of Java, but other aspects such as the lack of com- 
plex arithmetic continue to hinder its use for numeri- 
cal computation. JVMs exploit just-in-time compilation, 
in which Java bytecode is compiled into native machine 
code at run time. The JVM has importance beyond Java: 
some more recent languages such as Scala (2003) and 
Clojure (a dialect of Lisp created in 2007) compile to 
JVM bytecode. 

Of the many languages introduced since C++, the 
most important from the computational mathemat- 
ics point of view is Python (1991), designed by Guido 
van Rossum. Python is a dynamic language, which 
means that it lies somewhere between an interpreted 


language and a compiled language, with many fea- 
tures of the latter. It supports several programming 
paradigms, including object orientation and functional 
programming. Its success in scientific computing stems 
to a large extent from its libraries, which provide core 
computational and graphics capabilities (NurnPy, SciPy, 
and matplotlib), and from its ability to integrate com- 
ponents written in other languages, such as C and For- 
tran. It has been said that “one doesn’t need to switch 
to Python, only to know where to use it.” Python was 
designed to be a readable language, and its expression 
syntax is similar to that of C. 

The newest language discussed here is Julia (2012), 
designed specifically for high-performance scientific 
computing. Julia is a dynamic language that achieves 
speed approaching that of compiled C code, in part 
due to just-in-time compilation using the LLVM com- 
piler infrastructure. A distinctive feature of Julia is its 
exploitation of multiple dispatch, which allows a func- 
tion to exist in several forms operating on different data 
types, with the appropriate version being called at run 
time based on the actual arguments supplied. An inter- 
esting feature of Julia is that it allows the user to view 
the underlying assembly language that the language 
generates. Viewing these low-level operations can pro- 
vide much insight into how the language works and its 
efficiency (see figure 3). 

The Fortran standard has undergone regular revi- 
sions, known as “Fortran xy,” where xy is 77, 90, 95, 
2003, or 2008 and is related to the year of publication 
of the standard. Fortran 77 introduced an if-then-else 
construct, improved input/output, and a character data 
type. Fortran 90 incorporated dynamic array allocation, 
operations on arrays, modules (a mechanism for pack- 
aging data, derived types, subprograms, and interface 
blocks), recursive subprograms, numeric inquiry func- 
tions, and parametrized intrinsic types. Later revisions 
have introduced support for object-oriented program- 
ming and for handling exceptions in IEEE floating-point 
arithmetic, and interoperability with C. 

3 Parallelism 

Most of the languages mentioned above do not include 
facilities for managing execution of codes in parallel, 
that is, for specifying how a computation is to be bro- 
ken up and executed by different processors simul- 
taneously. Various extensions of existing languages 
have been proposed for parallel computing, but gen- 
erally they have not achieved widespread or long-term 
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In [1] : f(x,y) = x*y 

Out[l] : f (generic function with 1 method) 

In [2]: @code_native f(3,5) 

. text 

Filename: In[l] 

Source line: 1 
push RBP 
mov RBP, RSP 
Source line: 1 

imul RDI , RSI 
mov RAX , RDI 
pop RBP 
ret 

In [3]: @code_nati ve f (3. 0,5.0) 

. text 

Filename: In[l] 

Source line: 1 
push RBP 
mov RBP, RSP 
Source line: 1 

mulsd XMMO , XMM1 

pop RBP 

ret 

Figure 3 A short Julia session, run from within a Jupyter 
notebook. The text that follows “In [■]:” on a line is user 
input. The definition of the function f does not specify the 
types of the arguments. Julia generates different x86 assem- 
bler code depending on whether the actual arguments are 
integers (as in In [2]) or floating-point numbers (as in In 
[3]). 

use. Two widely used systems for parallel computing 
are the Message Passing Interface (MPI) for distributed 
memory systems and Open MP for shared-memory sys- 
tems. Both are implemented as application program- 
ming interfaces (APIs) that can be invoked from lan- 
guages such as C, C++, and Fortran. For expressing 
parallelism on specialist devices such as graphics pro- 
cessing units (GPUs), specialist languages are available, 
such as CUDA for NVIDIA GPUs and Open Comput- 
ing Language (OpenCL) for GPUs and heterogeneous 
platforms in general. 

4 Problem-Solving Environments 

Nowadays, a large part of scientific computing is done 
within environments that provide a programming lan- 
guage, an interactive command window with the dis- 
play of graphics, and the ability to export graphics and 
more generally publish documents to HTML, PDF, TjX, 
and so on. They usually also have the ability to mix 
numerical and symbolic computing and by default dis- 
play the result of assignments in the command window. 


The term problem-solving environments (PSEs) is used 
for such systems, of which there are many. 

PSEs have dynamic languages that, combined with 
the interactive interface, avoid the edit-compile-run 
cycle of languages such as C and Fortran. They allow 
quick coding without the need to define the types of 
variables before use. Moreover, a PSE language typi- 
cally includes high-level constructs that would corre- 
spond in a traditional language to many lines of code, 
such as a command to find the indices of the largest 
element(s) of an array or to compute the eigensystem 
of a matrix. Since it is generally accepted that a pro- 
grammer’s productivity, measured in the number lines 
of code written, is independent of the language, it fol- 
lows that using a higher-level language should allow 
the programmer to achieve more in a given time. On 
the other hand, PSEs usually do not execute code as 
fast as a compiled language. 

The oldest PSE is MATLAB, originally written in For- 
tran in 1978 by Cleve Moler as a means of providing 
students with easy access to the EISPACK and LINPACK 
linear algebra program libraries. Rewritten in C, MAT- 
LAB was released as a commercial product in 1984 by 
The MathWorks. The fundamental data type in MAT- 
LAB is a matrix, and MATLAB fully supports complex 
arithmetic. 

An interesting feature of MATLAB is that much of 
it is written in MATLAB, in the form of M-files contain- 
ing MATLAB commands. Certain key functions are writ- 
ten in C or call vendor-supplied basic linear algebra 
SUBPROGRAMS (BLAS) [IV.10 §13] or LAPACK [IV.10 §13] 
codes. MATLAB programs tend to be much shorter 
than their equivalents in compiled languages, and yet, 
depending on the nature of the code, they can run at 
similar speed. Because of the ease and economy of 
coding, and the interactive interface that aids debug- 
ging, MATLAB is often used as a prototyping tool, an 
environment for developing and testing ideas before 
implementing them in a language such as C or Fortran. 

GNU Octave is free software with many of the fea- 
tures of MATLAB and a largely compatible syntax, so 
that carefully coded programs can run in both MAT- 
LAB and Octave. Scilab is another open-source alterna- 
tive to MATLAB, but it is less compatible with MATLAB 
than Octave. 

Maple started out as a computer algebra system 
developed at the University of Waterloo in 1980. It is 
now a commercial product sold by Waterloo Maple and 
has all the usual features of a PSE. 
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Mathematica, by Wolfram Research Software, had a 
notebook interface from its first release in 1988, show- 
ing program code, output with typeset mathematics, 
and graphics in a single window. It supports proce- 
dural, functional, rule-based, and pattern-based pro- 
gramming paradigms. It is particularly popular in the 
physics community. 

R is a freely available PSE targeted at statistical com- 
puting and data analysis. Many contributed R packages 
are available on the Comprehensive R Archive Network 
(CRAN). 

Sage is an open-source, Python-based PSE that builds 
on many other open-source packages. It has a browser- 
based notebook. 

Project Jupyter (formerly known as IPython) is an 
open-source project that includes a network proto- 
col for interactive computing in any programming lan- 
guage, a browser-based notebook interface, and tools 
for sharing and converting these notebooks into mul- 
tiple output formats, including HTML and PDF. This 
makes the Jupyter Notebook a full-fledged PSE for Julia, 
Python, R, and other languages. 

5 Programming Miscellany 

We now focus on a variety of different aspects of pro- 
gramming that have a particular relevance to applied 
mathematics. 

5.1 Pseudocode 

In the early days of computing it was common to 
include a complete program listing in an article, as can 
be seen in the 1950s issues of the journal Mathemat- 
ical Tables and Aids to Computation. This practice is 
now uncommon, not least because of the ease of dis- 
tributing code over the Web. It is now usual to describe 
in print the underlying algorithm in terms of a pseu- 
docode that the author bases informally on the control 
structures and other syntax of a particular program- 
ming language (MATLAB being a common example). 
A good pseudocode combines precision, brevity, and 
readability. For examples of pseudocode see the article 
on ALGORITHMS [1.4]. 

5.2 Abstraction 

The mathematical concept of abstraction has proved to 
be important in programming, where it refers to sepa- 
rating concepts from implementation details. Subpro- 
grams take input arguments, carry out a computation, 


and then return output. How they do it need not be 
known to the programmer who invokes them, so a sub- 
program is an abstraction of the computation it carries 
out. Abstraction applies to both procedures and data, 
and is used to the full in object-oriented programming. 

5.3 Influence on Mathematics 

While mathematics has had a strong influence on pro- 
gramming language design, programming languages 
have also influenced mathematics. We already noted 
that APL introduced the ceiling and floor notation. The 
array subscripting (or slicing) notation A(i : j , p : q) — 
used in Algol 68, MATLAB, and other languages to 
denote the subarray comprising the intersection of 
rows i to j and columns p to q of the two-dimensional 
array A— is now widely used in numerical linear algebra, 
especially in pseudocode. 

In a 1928 paper, Kurt Hensel suggested the notation 
A\B for A^ 1 B and A/B for ALT 1 , but it did not catch on. 
Cleve Moler independently introduced the notation in 
MATLAB, and the term “backslash” is now commonly 
used to mean solving a matrix equation. 

5.4 Notation for Expressions 

In mathematics we normally write expressions in the 
conventional infix notation illustrated by a + b(c-d), 
using parentheses and the usual precedence rales to 
specify the order of operations. In Lisp and related 
languages, the expression above is written 

+ a * (b (- c d)) (1) 

in which each arithmetic operator is followed by its two 
arguments. This prefix notation (also called Polish nota- 
tion) is easier for computers to parse. The evaluation 
proceeds left to right, with the arguments of each oper- 
ator evaluated recursively (in practice, using a stack), 
and no knowledge of the precedence of the operators 
is necessary. 

The parentheses in (1) are not strictly necessary, but 
they are required in Lisp because operators can take 
multiple arguments: +12 3 evaluates to 6. 

In reverse Polish notation (RPN) the operator follows 
rather than precedes the operands (as in the expression 
n! for a factorial). The expression (1) is written 

a b c d - * + 

which is again evaluated left to right, with the variables 
a and b set aside until it is time to use them. An alterna- 
tive way to write the expression that mingles the data 
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and the operands is 

c d - b * a + 

RPN is used in the languages Forth and PostScript, and 
on HP pocket calculators. 

5.5 Syntactic Peculiarities and Pitfalls 

While there is much commonality between different 
languages, certain differences can catch the unwary 
programmer out. In most languages a single equals 
sign denotes an assignment: x = 1. A test for equal- 
ity is written with a double equals in C and MATLAB: 
if (x == y). If the test is written if (x = y) then in 
C this results in y being assigned to x and the i f test 
being passed if x is nonzero because (x = y) evaluates 
as a true Boolean expression. Algol and Pascal use : = 
for assignment, but this is not common in modern lan- 
guages. R has two assignment operators, <- and =, of 
which only the former can be used anywhere in a pro- 
gram. The test for “not equal” is even more varied: '= 
in MATLAB, ! = in C, R, and Python, . NE . in Fortran 77, 
/= in Fortran 90, and <> in Basic and Pascal. 

A common operation is to increment a variable, 
which is typically done using a statement such as x = x 
+ 1. Some languages provide a shorthand notation for 
this operation: in Python it is x += 1 and in C, C++, and 
Java it is x++. A subtlety is illustrated by the C code 

i =1; j =1; a=i++; b= ++ j ; 

which results in a = 1 and i = j = b = 2 because the 
assignment is done before the incrementation with i ++ 
and after for ++ j . 

Another aspect of syntax that varies among lan- 
guages is operator precedence in expressions. An 
expression a*b + c is interpreted as ah + c in most 
languages, but as a [b + c) in APL, which does not have 
any operator precedence and always evaluates right 
to left. However, it is for relational and logical oper- 
ators that differences are most common. An expres- 
sion of the form x or y and z (with symbols such as 
| and & replacing or and and in many languages) can 
mean x or (y and z) or (x or y) and z depending 
on the language. In Lisp, expressions must be fully 
parenthesized, so they always have an unambiguous 
mathematical meaning. 

5.6 Booleans 

A Boolean, or logical, data type contains two possible 
values: true and false. Many languages denote these 


values true and fal se. Exceptions include Lisp (t and 
ni 1 ) and Fortran ( .true . and . fal se . ). 

C does not have a Boolean data type and instead 
regards any nonzero numerical value as representing 
true and zero as representing false. In MATLAB, logi- 
cal values are converted to 0 (for false) or 1 (for true) 
in numerical expressions, and this can be useful in a 
one-line expression such as 

(exp(x) - 1 + (x == 0)) /( x + (x == 0)) 

which evaluates to (e x - 1)1 x when x 0 and to 1 = 
lim x -o(e x - l)/x when x = 0, avoiding what would 
otherwise be a division by zero. 

5.7 Array Storage and Array Indexing 

Fortran stores arrays in column major order, meaning 
that a two-dimensional array is stored sequentially in 
memory, with the elements of the first column being 
followed by those of the second, and so on. C and many 
other languages store arrays in row major order. This 
difference is inconvenient when calling Fortran codes 
from other languages. Knowledge of the storage format 
is crucial because for efficiency it is important to access 
elements of arrays in the order in which they are stored. 

Programming languages differ as to the starting 
index for arrays. In Fortran and MATLAB, for example, 
arrays start at index 1 (a(l) , a(2) , ...), whereas in 
C and Python the first index is 0 (a [0] , a[l] , ...). 
Note that the type of brackets used for array indices, 
round or square, also varies, as illustrated. Mathemati- 
cal descriptions of an algorithm may use 0 or 1 as the 
starting index, depending on the notation in effect. 

The syntax for array slices also differs between lan- 
guages. While in Fortran and MATLAB a(i : j) extracts 
a(i), a(i+l), ..., a(j), in Python a[i:j] extracts 
a [i ] , a[i +1] , . . . , a[j-l] , so a[ j] is omitted. These 
differences can be a cause of confusion and bugs. One 
needs to be aware of them and program with care. 

5.8 Complex Arithmetic 

Computations with complex numbers are ubiquitous in 
applied mathematics. From its earliest versions Fortran 
has had a complex data type that can be used in expres- 
sions such as a + b*c, just like the real and double- 
precision data types. In some other languages, func- 
tions implementing complex arithmetic can be written 
but then expressions must be converted to a sequence 
of function calls, such as cadd(a,cmult(b,c)). The 
PSEs mentioned above all support complex arithmetic, 
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as do C (introduced in the 1999 standard), C++, Julia 
(which uses i m rather than i for the imaginary unit), 
and Python (which uses j for the imaginary unit). 

It cannot necessarily be assumed that the compiler 
or interpreter implements complex arithmetic in the 
most accurate and robust way. For example, if the mod- 
ulus of a complex number is computed as \a + ib\ = 
{a 2 + b 2 ) 112 , then the intermediate sum of squares can 
overflow even when \a + ib\ is representable as a finite 
floating-point number. The possibility of overflow is 
easily avoided by evaluating |a|(l + (b/a) 2 ) 112 when 
\a\ ^ \b\ and an analogous expression when \b\ > \a\. 
Operations such as complex division and evaluation 
of complex elementary functions are more difficult to 
implement reliably. 

5.9 Variable Names 

In mathematics, variable names are usually one letter, 
Greek or Roman, in lowercase or uppercase. Since For- 
tran introduced the possibility of variable names hav- 
ing more than one letter (albeit limited to six letters in 
Fortran 77 and earlier versions), multiletter names have 
been common. Due to the use of long variable names 
comprising several words joined together, several nam- 
ing conventions have been introduced, illustrated by 
endOfFile or EndOfFile (camel case), end-of-file, 
and end_of_f i 1 e (pothole case). Of course, which 
characters are allowed in variable names depends on 
the language. The use of long variable names is facili- 
tated by text editors that allow autocompletion. 

5.10 Floating-Point Semantics 

Many mathematical relations fail to hold in floating- 
point arithmetic [11.13] because of the effects of 
rounding errors. For example, (a+b)+c and a+(b+c) 
will in general evaluate to results differing at the round- 
off level. Unfortunately, much more subtle issues can 
cause mathematical relations to break down. Intel x86 
chips have 80-bit registers whose precision exceeds 
that of 64-bit double-precision variables. After the 
assignment x = 1 . 0/3 . 0 to a double-precision variable 
x, a test if x == 1 . 0/3 . 0 can return false with some 
optimizing compilers if 1 . 0/3 . 0 is temporarily stored 
in an extended precision register. 

Some processors offer a fused multiply-add (FMA) 
instruction that evaluates an expression x*y + z with 
just a single rounding error; that is, the result is the 
exact value of x*y + z rounded to the target precision. 
The behavior of a program can then depend on the 


compiler in subtle ways. For example, the discriminant 
b 2 - 4 ac of a quadratic equation can evaluate as nega- 
tive when b 2 ^ 4ac if an FMA is used. These kinds of 
behavior make it very difficult to prove rigorous cor- 
rectness results for computer programs executed in 
floating-point arithmetic. 

5.11 Floating-Point Parameters 

Programs that perform floating-point computation of- 
ten need to use parameters of the floating-point arith- 
metic, such as the unit roundoff [11.13] (typically in 
a convergence test) or the overflow level. Some lan- 
guages, such as Fortran, provide direct access to these 
parameters via intrinsic functions. For those that do 
not, there are ways to compute them at run time, 
though these may not be entirely reliable when used 
with optimizing compilers. 

5.12 High-Precision Computations 

The IEEE floating-point arithmetic standard defines 
single- and double-precision formats corresponding 
to about eight and sixteen significant decimal dig- 
its, respectively. Most programming languages support 
two floating-point data types that map onto these for- 
mats. A 2008 revision of the IEEE standard added a 1 28- 
bit quadruple-precision format, which corresponds to 
about thirty-two significant decimal digits. Quadruple 
precision is not yet available in hardware, so arith- 
metic of precision higher than double must currently 
be provided in software. 

In Fortran 90 and later versions of Fortran the avail- 
ability of different precisions can be queried, through 
the sel ected_real_ki nd function. This allows access 
to quadruple precision if it is supported by the com- 
piler. 

A number of open-source libraries are available 
that implement arbitrary precision floating-point arith- 
metic. The GNU MPFR library is a C library that provides 
correctly rounded arithmetic and mathematical func- 
tions, and it is used by Julia’s BigFloat data type. The 
GNU MPC library builds on MPFR to handle complex 
arithmetic. Mpmath is a Python library for arbitrary 
precision floating-point arithmetic. 

High-precision arithmetic has many uses, includ- 
ing in EXPERIMENTAL APPLIED MATHEMATICS [VIII.6] 
and for obtaining accurate solutions to ill-conditioned 
problems. For a researcher developing or testing a 
numerical algorithm, high precision provides a way to 
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compute reference solutions that allow the accuracy of 
the algorithm to be tested. 

5.13 Types 

A number of subtle issues in programming languages 
revolve around the data type of a variable or expres- 
sion: integer, floating point, logical, string, and so on. 
Some languages, such as C and Java, require the type 
of a variable to be explicitly declared before an assign- 
ment is made to that variable. For example, in C the 
statement double x = 1.1 both declares x to be a 
double-precision variable and gives it the value 1.1. 
Some languages make specifying the type of a variable 
optional or not possible at all. Fortran uses implicit typ- 
ing: if the type is not specified then a default type is 
assigned based on the first letter of the variable (integer 
for i to p and real otherwise). However, it is regarded 
as good practice to turn off this implicit typing with the 
statement i mpl i ci t none. PSEs tend to determine the 
type at the point of assigmnent. 

The type of a variable or expression might be fixed 
or it might be able to change during the execution of 
a program. For example, some languages allow a string 
to be added to a number and define the result to be 
either a string or a number. The terms weakly typed and 
strongly typed are often used in this context to charac- 
terize a language’s type system, but these terms have 
no commonly agreed definition. 

Type systems have an important influence on pro- 
grams in at least two main ways. First, many program- 
ming errors are caused by variables (or constants) hav- 
ing an incorrect type. An apocryphal story tells of the 
loss of a 1960s NASA rocket due to the Fortran 66 
software controlling the rocket having a line of the 
form DO 10 k = 1 . 3 instead of the intended DO 10 k 
= 1,3, which starts a loop. The mistyping of a period 
for a comma in the former statement causes the For- 
tran compiler to interpret it as the assignment of 1.3 
to the variable DO 10 k, since spaces are unimportant in 
Fortran 66 source code, and the implicit typing of For- 
tran causes the variable DOlOk to be created with real 
type. 

The second influence of a type system is on efficiency, 
since the speed at which a code executes v\ill depend on 
how much the compiler or interpreter knows about the 
types of the variables. The computation of x*y will run 
much slower than it might if at run time the types of x 
and y must be checked to decide whether to issue an 
integer multiplication or a floating-point multiplication 


instruction. Figure 3 illustrates the point, but in this 
instance the decision is made at compile time, with no 
loss of efficiency. 

5.14 Complexity Analysis 

Several measures of the complexity of a code have been 
proposed. They can be used to estimate the probability 
of bugs, the difficulty of testing, and the cost of main- 
tenance of the code. The metrics apply to individual 
components such as functions, subroutines, and proce- 
dures, and a large complexity measure can be reduced 
by breaking the component into smaller pieces. 

The simplest metric is the number of executable 
lines of source code. The cyclomatic complexity, or 
McCabe complexity, of a code is defined in terms of 
the directed graph [ 11.16] that has nodes given by 
blocks of code containing no decisions or branches 
and edges corresponding to branches between nodes. 
The cyclomatic complexity is given by the formula 
edges - nodes + 2, and it turns out to be equal to 
one plus the number of predicates (logical tests). The 
Npath metric is the number of possible execution paths 
through the code, which can be much larger than the 
cyclomatic complexity. 

Tools are needed to compute these metrics. In MAT- 
LAB the function checkcode computes the cyclomatic 
complexity. 

5.15 Formatting of Source Code 

Mathematicians are used to having complete freedom 
in how they lay out their written mathematics on the 
page. Programming languages vary in their prescrip- 
tiveness of the layout of the source code. Most impose 
few restrictions and allow one to collapse a program 
block onto a single very long line provided comments 
are removed and (if necessary) statement separators 
are added. When computers had small memories, such 
a transformation would sometimes be done in order 
to save having to store the carriage return and line 
feed characters. Sometimes further code obfuscation 
is done in order to conceal the purpose of a code, for 
security reasons. 

Fortran 77 requires code to lie between columns 7 
and 72, with columns 1-5 reserved for statement num- 
bers and column 6 for indicating a continuation line. 
These restrictions stem from the punched cards used to 
enter programs into early computers and were removed 
in Fortran 90. 
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Some text editors provide automatic indentation tai- 
lored to the language being edited, and various pretty 
printing tools are available to format code for readabil- 
ity or to impose a particular house style. The use of 
such tools can aid debugging and make it simpler to 
compare different versions of a program with a diff 
command. Python is unusual in that it uses indenta- 
tion to define if statements, for loops, while loops, and 
so on, whereas most languages use braces, brackets, or 
key words to delimit code blocks. 

5.16 Readability 

There are often several ways to write a piece of code. 
A balance needs to be struck between length of code, 
efficiency, and understandability. In C++, for an integer 
variable n one can compute the expression 2*n+l as 
n « 1 | 1, where « is the bit-shift left operator and 
| is the bitwise or. The latter version is, however, rather 
inscrutable and may not be any faster than the former 
under a good compiler. 

Sometimes one needs to make a variable cycle 
between several values. If the values are 0 and 1 then 
the assignment n = 1 - n flips between them and the 
purpose of the assignment is reasonably clear. Sup- 
pose, though, that we wish to make n take on the val- 
ues 1, 2, 3, repeatedly. If we can find a polynomial p 
such that p( 1) = 2, p( 2) = 3, and p(3) = 1, then the 
assignment n = p(n) will do the trick. Such a p is a 
polynomial interpolant [1.3 §3.1] to the given data, 
and the p of lowest degree is the quadratic p(x) = 
-\x 2 + ir-x - 2. However, the purpose of the assign- 
ment with p is not obvious, and its correctness is not 
trivial to check. An i f statement of the form 

if n == 1 
n = 2 

el seif n == 2 
n = 3 

el se 

n = 1 

end 

does the job in a more transparent fashion. Alterna- 
tively, an assignment replacing n by (n mod 3) + 1 could 
be used, supplemented by a comment explaining its 
purpose. 

5.17 Structured Programming 

In Fortran a go to statement causes a jump to a labeled 
statement anywhere in the program. In 1968 Edsger 


Dijkstra wrote a letter to the editor of the journal Com- 
munications of the ACM in which he claimed that the 
use of go to statements, which were very common 
in Fortran 66 programs, represented poor program- 
ming practice. The letter was published with the title 
“Go to statement considered harmful.” The notion of 
structured programming subsequently became popu- 
lar. Structured programming enforces a logical struc- 
ture on the program that makes it easier to under- 
stand and modify through the use of certain canon- 
ical control structures together with modular com- 
position of programs. A long 1974 paper by Don- 
ald Knuth titled “Structured programming with go to 
statements” presents a balanced analysis of the pros 
and cons of go to statements. 

5.18 Literate Programming 

In the 1980s Knuth championed the idea of literate pro- 
gramming, in which a document contains a combina- 
tion of source code and documentation for the code 
(in TfX format), and both the code and the documenta- 
tion can be generated from it. He used this approach 
to great effect in writing TjX and associated programs 
using his WEB system (which has no connection with 
the World Wide Web, which it predates). Nowadays, lit- 
erate programming is mainly used in two forms. In the 
first, documentation is embedded in comment lines of 
a program’s source code and documentation genera- 
tion tools are used to extract it to HTML, PDF, etc. In 
the second form, a document contains code that car- 
ries out the computational experiments needed for a 
paper, and a separate “preprocessor” executes the code 
and inserts its output (numeric or graphical) back into 
the source document. This approach facilitates repro- 
ducible research [VIII. 5] and is typically done with 
“weave” tools available for R and Python or in Emacs 
Org mode. 

5.19 Interoperability 

Interoperability refers to the ability to call a program 
written in one language from a program written in a 
different language. Historically, the degree of interop- 
erability that is available has depended on which oper- 
ating system and compiler is in use, as well as on the 
languages themselves. Even when cross-language calls 
are possible there are pitfalls to watch out for, such 
as the potentially different ways in which multidimen- 
sional arrays are stored in different languages (see sec- 
tion 5.7). There is a strong trend to mixed-language 
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programming, encouraged by languages such as C++, 
Julia, and Python that have been designed with inter- 
operability in mind, by the support provided in PSEs 
for calling or being called by another language, and by 
languages that are built on the same virtual machine 
(such as Java, Scala, and Clojure). 

5.20 Domain-Specific Languages 

A domain-specific language (DSL) is a language focused 
on a particular problem domain, examples being HTML 
for Web pages, SQL for databases, and TjX for math- 
ematical typesetting [VIII.4 §1]. An important bene- 
fit of a DSL is that it can allow programming at a high 
level of abstraction that fully exploits knowledge of the 
problem domain and thereby reduces the total time to 
deliver a solution to a problem. 

Applied mathematics has a variety of DSLs, and these 
often involve symbolic manipulation as part of the 
code-generation process. The General Algebraic Mod- 
eling System (GAMS) is a high-level modeling system 
for mathematical optimization. It includes a DSL in 
which optimization problems of several different types 
can be specified. A number of DSLs are associated 
with software for solving partial differential equations. 
For example, the Unified Form Language in the FEn- 
iCS project is a DSL for finite-element discretizations, 
implemented as a Python module. 

DSLs for plotting graphics are plentiful, even being 
built on top of other DSLs (e.g., the various graphics 
packages for BTgX). 

5.21 Translation between Languages 

In mathematics we are used to translating between dif- 
ferent notations and moving from one space or basis 
to another. It is natural to ask whether a program can 
be transformed from one language to another with- 
out any change in its behavior. One reason for want- 
ing to do so is to convert programs that were written 
many years ago but are still used today (legacy codes) 
into a more modern language. Such translation tools 
are available, but they are used out of necessity rather 
than as a standard tool. A tool called f2c written at Bell 
Labs in the 1990s could convert Fortran 77 codes to C, 
though the resulting code was not meant to be readable 
by humans. There are more recent tools for converting 
Fortran to C++ that produce more readable code. 


Table 1 Extract from the TIOBE Programming Community 
Index for February 2015. Clojure, Forth, Mathematica, and 
OpenCL are all ranked in the range 51-100. 


Language 

Rank 

C 

1 

Java 

2 

C++ 

3 

Python 

8 

Visual Basic 

9 

MATLAB 

17 

R 

18 

Pascal 

19 

PostScript 

24 

Fortran 

31 

Lisp 

32 

Scheme 

38 

Scala 

41 

PL/I 

45 


5.22 Popularity of Languages 

An interesting question is which are the most popu- 
lar programming languages. This question is both hard 
to define precisely and hard to answer. One attempt is 
provided by the TIOBE Programming Community Index 
(www.tiobe.com), which is produced once a month 
based on “the number of skilled engineers world-wide, 
courses and third party vendors,” as found via popu- 
lar search engines. Table 1 shows a ranking of most 
of the languages mentioned in this article. These rank- 
ings are quite volatile and should not be taken too seri- 
ously, but an interesting implication is that old lan- 
guages such as Fortran and Lisp continue to compete 
with their younger counterparts. 

5.23 Language of the Future 

An old joke goes, “I don't know what language we'll 
be using in fifty years time, but it will be called For- 
tran.” Fortran has been under attack since the 1960s 
but shows no signs of dying, as noted in the previ- 
ous subsection. The frequent revisions to the Fortran 
standard have kept the language up to date, while the 
huge amount of legacy code means that in many appli- 
cations it is difficult or impossible to switch to alter- 
native languages. The improved interoperability of lan- 
guages and compilers enables binaries of compiled For- 
tran libraries such as LAPACK and the commercial NAG 
Library to be readily called from other languages and 
even Excel spreadsheets. Perhaps the future is inher- 
ently multilingual, with programs being written in a 
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modern language such as C++ or Python and calling ker- 
nels written in C, Fortran, or assembly language tuned 
for particular processors by the manufacturer. 

One thing we can be sure of is that new languages 
will continue to be developed, each trying to combine 
the best features of existing languages with new ideas 
that resonate with developments in hardware and soft- 
ware. However, it is important that language designers 
remember the lessons of the past and contemplate the 
comment of Tony Hoare about Algol 60: “Here is a lan- 
guage so far ahead of its time, that it was not only an 
improvement on its predecessors, but also on nearly all 
its successors.” 

Further Reading 

The following list is very selective and merely provides 
a starting point for further exploration. 

Abelson and Sussman (1996) is a classic introduc- 
tion to programming based on Scheme that empha- 
sizes ideas such as abstraction and recursion over syn- 
tax. It has many interesting mathematical examples, 
including symbolic differentiation. 

The longevity of Fortran, with its multiple revisions, 
is such that its history, as told by Metcalf (2011), 
provides a prism into the history of programming 
languages. 

The Turing Award of the Association for Computing 
Machinery (ACM) is an annual award that is to computer 
science what the Fields Medal is to mathematics. The 
book of lectures from the first twenty years of awards is 
full of insights into programming languages. It includes 
lectures by, among those mentioned in this article, 
Backus, Dijkstra, Iverson, Knuth, McCarthy, Ritchie, and 
Wirth. 

An excellent source for the history of programming 
languages (and computing) is the journal IEEE Annals 
of the History of Computing. 

Abelson, H., and G. J. Sussman. 1996. Structure and Inter- 
pretation of Computer Programs, 2nd edn. Cambridge, 
MA: MIT Press. 

ACM. 1987. ACM Turing Award Lectures: The First Twenty' 
Years, 1966-1985. Reading, MA: Addison-Wesley. 

Bentley, J. L. 1988. More Programming Pearls: Confessions 
of a Coder. Reading, MA: Addison-Wesley. 

Kernighan, B. W., and P. J. Plauger. 1978. The Elements of 
Programming Style, 2nd edn. New York: McGraw-Hill. 
Kernighan, B. W., and D. M. Ritchie. 1988. The C Program- 
ming Language, 2nd edn. Englewood Cliffs, NJ: Prentice- 
Hall. 


Knuth, D. E. 2003. Selected Paper on Computer Languages. 
Stanford, CA: Center for the Study of Language and 
Information. 

Metcalf, M. 2011. The seven ages of Fortran. Journal of 
Computer Science and Technology ll(l):l-8. 

Stroustrup, B. 2013. The C++ Programming Language, 4th 
edn. Upper Saddle River, NJ: Addison-Wesley. 


VII.12 High-Performance Computing 

Jack Dongarra 


1 Historical Overview 

Looking back on the last four decades, high-perfor- 
mance computing (HPC) has been characterized by 
rapid change in vendors, architectures, technologies, 
algorithms, software, and system usage. Despite all 
these changes, performance, as measured in terms of 
the number of flops 1 per second, has evolved steadily. 
Often cited in this context is Moore’s law, which states 
that the number of transistors on integrated circuits 
doubles approximately every two years. Figure 1 plots 
the peak performance of various computers over the 
last six decades, all supercomputers of their time, and 
demonstrates how well Moore’s law holds for perfor- 
mance for nearly the entire lifespan of modern com- 
puting. 

The initial success of vector computers in the 1970s, 
which could carry out operations on whole vectors at 
a time, was driven by raw performance. The introduc- 
tion of this type of computer system started the mod- 
ern supercomputing era. In the 1980s the availability of 
standard development environments and application 
software packages became more important. In addi- 
tion to performance, these criteria determined the suc- 
cess of multiprocessor vector systems, especially with 
industrial customers. 

Massively parallel computers, which share the work 
among a large number of processors, became suc- 
cessful in the early 1990s due to their better price- 
performance ratios, which were made possible by the 
improving performance of “off the shelf” micropro- 
cessors. At the lower end of the market and for mid- 
priced systems, massively parallel processing comput- 
ers were replaced by microprocessor-based symmetric 
multiprocessing systems (systems in which identical 
processors share the same memory) in the middle of 


1. A flop is an elementary floating-point operation: addition, sub- 
traction, multiplication, or division. 
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Figure 1 Peak performance of the fastest computer systems over the last six decades. 


the 1990s. The success of microprocessor-based sym- 
metric multiprocessing systems, even for very high-end 
systems, was the basis for the emergence of cluster 
concepts in the early 2000s. During the first half of 
the decade clusters of personal computers and work- 
stations became the prevalent architecture for many 
application areas. However, the Japanese Earth Simula- 
tor vector system (2002) demonstrated that many sci- 
entific applications could benefit greatly from a differ- 
ent computer architecture, creating renewed interest 
in new architectures and new programming paradigms 
within the scientific HPC community. 

The IBM Roadrunner system at Los Alamos National 
Laboratory, which employs a hybrid design built from 
commodity parts, broke the petaflops (10 15 flops per 
second) threshold in June 2008. The next major target 
is exascale computing (10 18 flops per second), a thou- 
sandfold increase over petascale, which is not expected 
to be achieved before 2020. 

2 Challenges 

Science priorities lead to scientific models, and mod- 
els are implemented in the form of algorithms. Algo- 
rithm selection is based on various criteria, such as 


accuracy, verification, convergence, performance, par- 
allelism, and scalability. Models and associated algo- 
rithms are not selected in isolation but must be evalu- 
ated in the context of the existing computer hardware 
environment. Algorithms that perform well on one type 
of computer hardware may become obsolete on newer 
hardware, so selections must be made carefully and 
may change over time. Moving forward to exascale com- 
puting will put heavier demands on algorithms in at 
least two areas: the need for increasing amounts of 
data locality in order to perform computations effi- 
ciently and the need to obtain much higher factors of 
fine-grained parallelism as high-end systems support 
increasing numbers of compute threads. As a conse- 
quence, parallel algorithms must adapt to this environ- 
ment, and new algorithms and implementations must 
be developed to exploit the computational capabili- 
ties of the new hardware. The transition from cur- 
rent sub-petascale and petascale computing to exas- 
cale computing wall be at least as disruptive as the 
transition from vector to parallel computing was in the 
1990s. 

We now describe some of the particular challenges 
that lie ahead in the use of HPCs. 
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2.1 New Algorithms for Multicore Architectures 

Multicore processors, in which a single chip contains 
two or more independent processing units called cores, 
are now ubiquitous from the desktop through to HPC 
systems. Scalable multicore systems will increase the 
cost of communication relative to computation. Within 
a node (a single multicore processor) data transfer 
between cores is relatively inexpensive, but across 
nodes the cost of data transfer is becoming very large. 
This trend is addressed by new approaches such as 
communication-avoiding algorithms (see section 2.4), 
algorithms that support simultaneous computation 
and communication, and algorithms that vectorize well 
and have a large volume of functional parallelism. 

2.2 Adaptive Response to Load Imbalance 

Adaptive multiscale algorithms are an important part 
of many applications because they apply computa- 
tional power precisely where it is needed. However, 
they introduce dynamically changing computation that 
results in processor workload imbalances because the 
distribution of tasks is static. As we move toward sys- 
tems with billions of processors, even naturally load- 
balanced algorithms on homogeneous hardware will 
present many of the same daunting problems with 
adaptive load balancing that are observed in today’s 
adaptive codes. For example, software-based recov- 
ery mechanisms for fault tolerance or energy man- 
agement will create substantial load imbalances as 
tasks are delayed by rollback to a previous state or 
correction of detected errors. Scheduling based on a 
directed acyclic graph also requires new approaches 
to optimize resource utilization without compromising 
spatial locality. These challenges require development 
and deployment of sophisticated software approaches 
to rebalance computation dynamically in response to 
changing workloads and conditions of the operating 
environment. 

2.3 Multiple-Precision Algorithms and Software 

One instance of the increasingly adaptive nature of 
libraries is the capability to recognize and exploit 
the presence of mixed-precision arithmetic. Motivation 
comes from the fact that, on modern architectures, 32- 
bit (single-precision) floating-point operations can exe- 
cute at least twice as fast as 64-bit (double-precision) 
operations. The performance of algorithms for solv- 
ing linear systems or computing eigenvalues or singu- 
lar values can be significantly enhanced by applying a 


given method in single precision and then using a few 
steps of iterative refinement [IV.10 §2] in double 
precision to elevate the accuracy of the result from sin- 
gle to double precision. This technique can be applied 
not only to conventional processors but also to other 
technologies such as graphics processing units, and 
it can therefore utilize heterogeneous hardware more 
effectively. The use of mixed precision exploits not only 
the greater speed of single-precision arithmetic but 
also the fact that there is a reduction in the amount 
of storage needed and in the amount of information 
moved from memory for 32-bit or single-precision data 
when compared with 64-bit or double-precision arrays. 

2.4 Communication-Avoiding Algorithms 

Algorithmic complexity is usually expressed in terms 
of the number of operations performed rather than 
the quantity of data movement within memory. How- 
ever, in modern systems memory movement is increas- 
ingly expensive compared v\ith the cost of computa- 
tion. It is therefore necessary to develop algorithms 
that reduce communication to a minimum while not 
unduly increasing the amount of computation. A gen- 
eral approach is to derive bandwidth and latency lower 
bounds for various dense and sparse linear algebra 
algorithms on parallel and sequential machines, e.g., by 
extending the well-known lower bounds for the usual 
0(n 3 ) matrix multiplication algorithm, and then to 
seek new algorithms that (nearly) attain these lower 
bounds. The study of communication-avoiding algo- 
rithms is in its infancy, but it is already leading to new 
algorithmic ideas and approaches. 

2.5 Auto-tuning 

Numerical libraries need to be able to adapt to the pos- 
sibly heterogeneous environment in which they have to 
operate in order to achieve good performance, energy 
efficiency, load balancing, and so on. The objective is to 
provide a consistent library interface that remains the 
same for users independent of scale and processor het- 
erogeneity but that achieves good performance and effi- 
ciency by binding to different underlying code, depend- 
ing on the configuration. In addition, the auto-tuning 
has to be extended to frameworks that go beyond 
library limitations and are able to optimize data layout 
(such as blocking strategies for sparse matrix kernels), 
stencil auto-tuners (since stencil kernels, which update 
array elements according to a fixed pattern, are diverse 
and not amenable to library calls), and even tuning of 
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the optimization strategy for multigrid solvers (opti- 
mizing the transition between the multigrid coarsen- 
ing cycle and the coarse grid solver to minimize run 
time). Adding heuristic search techniques and com- 
bining them with traditional compiler techniques will 
enhance the ability to address generic problems. 

2.6 Fault Tolerance and Robustness for Large-Scale 
Systems 

Modern personal computers may run for weeks with- 
out rebooting and most data servers are expected to 
run for years. However, because of their scale and com- 
plexity, today’s supercomputers run for only a few 
days before a reboot is needed. The major challenge in 
fault tolerance is that faults in extreme-scale systems, 
with their millions of processors, will be continuous 
rather than exceptional events. This requires a major 
shift from today’s software infrastructure. On today’s 
supercomputers every failure kills the application run- 
ning on the affected resources. These applications have 
to be restarted from the beginning or from their last 
checkpoint. The checkpoint/restart technique will not 
scale to highly parallel systems because a new fault will 
occur before the application can be restarted, causing 
the application to become stuck in a state of constant 
restarts. New fault-tolerant paradigms need to be devel- 
oped and integrated into both the system software and 
user applications. 

2.7 Building Energy Efficiency into Algorithm 
Foundations 

Energy consumption is becoming a major issue in HPC, 
with energy costs for some of the largest machines 
already exceeding a million dollars per year. Minimiz- 
ing power consumption must now be added to the tra- 
ditional goals of algorithm design: namely, correctness 
and performance. The emerging metric of merit is per- 
formance per watt. Energy reduction depends on soft- 
ware as well as hardware, so it is essential to build 
power and energy awareness, control, and efficiency 
into the foundations of numerical libraries. 

2.8 Sensitivity Analysis 

As the high-fidelity solution of models becomes pos- 
sible, the next challenge is to study the sensitivity of 
the model to parameter variability and uncertainty and 
to seek an optimal solution over a range of parameter 
values. The most basic form of analysis— the forward 


method for either local or global sensitivity analysis — 
simultaneously runs many instances of the model or 
its linearization, leading to an embarrassingly paral- 
lel execution model. Such high-throughput computing 
tasks are well suited to using spare cycles on pools 
of personal computers, e.g., running at night or over 
weekends. 

2.9 Numerical Pitfalls 

Problems that warrant the use of the fastest comput- 
ers are necessarily among the largest problems ever 
to be solved, according to any appropriate measure of 
problem dimension. Various mathematical or numeri- 
cal difficulties can potentially arise as dimensions grow 
ever larger, including slower convergence of an itera- 
tive method that has performed well for smaller prob- 
lems, computed results having lower accuracy due to 
an increased number of rounding errors, and overflow 
of intermediate results. A good example of what can 
go wrong concerns the use of random number gen- 
erators [VI.12] to construct the matrix A and vec- 
tor b for the linear system Ax = b to be solved by 
Gaussian elimination with partial pivoting for bench- 
marking purposes. The obvious approach is to fill the 
columns of the matrix A, one by one, with the out- 
put from a pseudorandom number generator. A few 
years ago, after a computation of this form lasting 
20 hours, the computed result was found to be incor- 
rect. The cause was eventually identified as a singular 
matrix A: the number of matrix elements exceeded the 
period of the random number generator, with the result 
that columns repeated and the matrix was singular. By 
itself, singularity should not affect the computation, 
since rounding errors usually ensure that the matrix 
is numerically nonsingular. However, the presence of 
exactly repeated columns eventually leads to “zero piv- 
ots,” which cause algorithm failure. The moral of the 
story is that code that has worked perfectly up to a 
certain problem size can fail in subtle ways for larger 
problems. 

One desirable numerical property of extreme-scale 
computing is bitwise reproducibility of results for any 
fixed processor count. But current computing frame- 
works and libraries do not guarantee reproducibility. 
The nonreproducibility is usually caused by a paral- 
lel reduction operation. While the corresponding oper- 
ation is mathematically associative, associativity may 
not hold in floating-point arithmetic. For example, the 
natural way to evaluate the sum a + b + c + d is from 
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left to right, but alternatives are (a + b) + (c + d) and 
(a + c) + (b + d), which are trivial examples of a par- 
allel reduction operation, and these three expressions 
will usually produce different results in floating-point 
arithmetic. In general, one cannot make assumptions 
about the order in which reduction operations are car- 
ried out in parallel, so the values computed in floating- 
point arithmetic may depend on the number of threads 
of execution. This makes it much harder to debug pro- 
grams. At extreme scale it may be possible to construct 
faster algorithms if the order of evaluation is not pre- 
specified: through the use of dynamic task schedul- 
ing, for example. Thus, there may trade-offs between 
speed and reproducibility. Furthermore, it may be pos- 
sible to more cheaply ensure a bound on the variability 
between different runs than to guarantee strfct repro- 
ducfbilfty, by usfng extra precision in seiected parts of 
an algorithm, for example. Many users may prefer non- 
reproducible results produced very quickly along with 
a bound on the variability. 

3 Outlook 

The move to extreme-scale computing will require col- 
laboration between hardware architects, systems soft- 
ware experts, designers of programming models, and 
implementers of the science applications that provide 
the rationale for these systems. The various issues dis- 
cussed in this article will need to be considered from a 
whole-system perspective, and the different tools will 
need to interoperate. As new ideas and approaches 
are identified and pursued, some will fail. As with 
past experience, there may be breakthroughs in hard- 
ware technologies that result in different micro- and 
macro-architectures becoming feasible and desirable, 
and these will require rethinking of algorithms and 
system software. 
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VII.13 Visualization 

Kristin Potter and Chris R. Johnson 

1 Introduction 

Schroeder, Martin, and Lorensen have offered the fol- 
lowing useful definition of visualization: 

Scientific visualization is the formal name given to the 
field in computer science that encompasses user inter- 
face, data representation and processing algorithms, 
visual representations, and other sensory presentation 
such as sound or touch. The term data visualization is 
another phrase to describe visualization. Data visual- 
ization is generally interpreted to be more general than 
scientific visualization, since it implies treatment of 
data sources beyond the sciences and engineering. . . . 
Another recently emerging term is information visual- 
ization. This held endeavors to visualize abstract infor- 
mation such as hyper-text documents on the World 
Wide Web, directory/file structures on a computer, or 
abstract data structures. 

The field of visualization is focused on creating 
images that convey salient information about under- 
lying data and processes. In recent decades, there 
has been unprecedented growth in computational and 
acquisition technologies, and this has resulted in an 
increased ability to sense the physical world in precise 
detail and to model and simulate complex physical phe- 
nomena. As such, visualization plays a crucial role in 
our ability to comprehend such large amounts of com- 
plex data and to convey insight into diverse scientific 
applications. 

Shown in figure 1, the “visualization pipeline” is 
one way to describe the process of visualization. 1 
The filtering step involves processing raw data and 
includes operations such as resampling, compression, 
and other image-processing algorithms such as feature- 
preserving noise suppression. In what can be consid- 
ered the core of the visualization process, the map- 
ping stage transforms the preprocessed filtered data 
into geometric primitives along with additional visual 
attributes, such as color or opacity, determining the 


1. The figures in this article are all reproduced with the permission 
of the SCI Institute, apart from plate 21, which is reproduced with 
the permission of Miriah Meyer. Citations in figure captions refer to 
sources of further information on the techniques and methods for 
creating the visualization. 
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visual representation of the data. Rendering utilizes 
computer graphics techniques to generate the final 
image using the geometric primitives from the mapping 
process. 

While the range of visualization applications is vast, 
it can be useful to classify techniques based on the 
type of data to be presented. These types can be 
summarized as whether they are 

• scalar fields (temperature, voltage, density, vector 
magnitudes, most image data), 

• vector fields (pressure, velocity, electric field, mag- 
netic field), or 

• tensor fields (diffusion, electrical and thermal con- 
ductivity, stress, strain). 

2 Scalar Field Visualization 

Scalar fields are among the most common data sets in 
scientific visualization and have therefore received the 
most research attention. 

2.1 Direct Volume Rendering 

Direct volume rendering is a method of displaying 
three-dimensional volumetric scalar data as two-di- 
mensional images and is probably one of the simplest 
ways to visualize volume data. As shown in plate 19, 
the individual values in the data set are made visi- 
ble by an assignment to a transfer function of opti- 
cal properties, such as color and opacity, which are 
then projected and composited to form an image. As 
a tool for scientific visualization, the appeal of direct 
volume rendering is that no intermediate geometric 
information need be calculated, so the process maps 
from the data set “directly” to an image. This is in con- 
trast to other rendering techniques such as isosurfac- 
ing or segmentation, in which one must first extract ele- 
ments from the data before rendering them. To create 
an effective visualization with direct volume rendering 
the researcher must find the right transfer function to 
highlight regions and features of interest. 

2.2 Isosurface Extraction 

Isosurface extraction is a powerful tool for investigat- 
ing volumetric scalar fields. The isosurface in a scalar 
volume is the surface on which the data value is con- 
stant, separating regions of higher and lower value. 
Given the physical or biological significance of the 
data value, the position of an isosurface, as well as its 
relation to other neighboring isosurfaces, can provide 


clues to the underlying structure of the scalar field. 
A dynamic use of isosurfaces can provide better visu- 
alization of complex space- or time-dependent behav- 
iors in many scientific applications. Another powerful 
technique for analyzing and computing isosurfaces and 
moving fronts is level set methods. 

3 Vector Field Visualization 

Visualizing vector field data is challenging because 
no existing natural representation can visually convey 
large amounts of three-dimensional directional infor- 
mation. Vector field methods must balance the con- 
flicting goals of displaying large amounts of direc- 
tional information while maintaining an informative 
and uncluttered display. Researchers have developed 
a number of vector field visualization techniques using 
iconic representations, particle tracing methods, and 
stream constructions. These methods are useful for 
showing certain field characteristics, but they inher- 
ently result in visual clutter when applied globally. 

In physical fluid flow experiments, external materi- 
als such as dye, hydrogen bubbles, or heat energy are 
injected into the flow. As these external materials are 
carried through the flow, scientists can track them visu- 
ally, using them to examine the underlying flow struc- 
ture. For example, to understand the flow patterns of 
river currents, scientists might release dye into the river 
to expose currents, eddies, and turbulence. Similarly, to 
understand the air flow around an aircraft wing, scien- 
tists may release smoke into wind tunnel experiments 
to provide visual information about flow patterns, tur- 
bulent regions, and vortex formation. Analogs to these 
experimental techniques have been adopted by scien- 
tific visualization researchers, particularly in the com- 
putational fluid dynamics field. These researchers have 
used numerical methods and three-dimensional com- 
puter graphics techniques to produce graphical icons 
such as arrows, motion particles, and other representa- 
tions that highlight different aspects of the flow. Advec- 
tion methods numerically integrate along paths defined 
by the vector field. For example, the researcher can cre- 
ate streamlines by advecting points through an instant- 
aneous, static vector field and tracing their path. They 
create pathlines, on the other hand, by advecting points 
through a dynamic, time-varying vector field. 

4 Tensor Field Visualization 

In physical and biological systems, representing intrin- 
sic material properties is an essential part of accurate 
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Figure 1 The visualization pipeline. 


modeling and imaging. Electrical conductivity and 
molecular diffusivity are examples of material proper- 
ties that describe the ability of particles (such as elec- 
trons and water molecules) to pass through a given 
material. In the simplest situation, the property is con- 
stant in all directions, a condition called isotropy. On 
the other hand, anisotropic materials are those that 
exhibit directional sensitivity in the rate of transport: 
for example, water diffuses through paper faster in one 
direction than another. In real-world problems, mate- 
rial properties are often inhomogeneous : varying as a 
function of position within the material. Thus, proper 
modeling of conductivity and diffusivity requires a field 
of tensor values sampled in three dimensions. Gaining 
insight into the structure of three-dimensional tensor 
fields is a significant and ongoing problem in tensor 
visualization. 

Creating meaningful images or models from diffu- 
sion tensor data is challenging because each sample 
point has six independent degrees of freedom. As with 
vector visualization, simple attempts at encoding all 
the tensor variables at all sample locations rapidly pro- 
duce unintelligible visual clutter. In some application 
instances, such as diffusion tensor magnetic resonance 
imaging of nerve tissue, the degree of anisotropy has 
a biological significance relating to the white matter 
structure, as shown in plate 20(b). An effective way to 
avoid clutter is, therefore, to display only those tensors 
that exhibit anisotropy of a certain degree or greater, 
often using graphical icons such as three-dimensional 
quadrics to indicate the six degrees of freedom through 
shape and size. Another popular technique of feature 
extraction is fiber tractography, which seeks to create 
pathways to illustrate directional tissue structure. Stan- 
dard visualization methods for this type of data use 
hyper-streamlines to illustrate the directional path of a 
fiber bundle, as shown in plate 20(a). Such methods are 


often augmented with edge bundling to reduce clutter, 
or blurring to indicate areas of noise or uncertainty. 

5 Applications 

The most important aspect of visualization is its role 
in analysis, exploration, and discovery in the scientific 
process. State-of-the-art technologies are combined in 
creative ways to facilitate data understanding. It is the 
role of the visualization researcher to understand how 
to combine appropriate visualization techniques with 
hypotheses about the data to reveal answers to scien- 
tific questions. Here, we briefly discuss three real-world 
applications of visualization. 

5.1 Genomics 

MizBee is a multiscale synteny browser for exploring 
conservation relationships in comparative genomics 
data (see plate 21). Synteny refers to the presence of 
two or more genes on the same chromosome and can be 
used to answer questions about evolution and genomic 
function by comparing the genomes of different species 
to find regions of shared sequences. Using side-by-side 
linked views, MizBee enables efficient data browsing 
across a range of scales, from the genome to the gene. 
To present pairwise comparisons of similar regions 
between different chromosomes, concentric circular 
layouts are used to show two genes and lines are drawn 
between regions to show pairs. On the right-hand side 
of the figure, the chromosome view presents a detailed 
look at user-selected blocks of interest, along with sta- 
tistical information and layered annotations. The block 
view (rightmost) is the most detailed view, providing 
information about the conservation relationships of 
features within the selected block related to proxim- 
ity/location, size, orientation, and similarity. Each view 
is optimized for viewing the different types of rele- 
vant information; the views can communicate relevant 
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information and degree of similarity for each region. 
The design of MizBee is grounded in perceptual prin- 
ciples, and it includes several techniques, such as edge 
bundling and layering, to enhance visual cues about 
relationships and the consistent use of an 8-channel 
color map to effectively distinguish regions. 

5.2 Climate and Weather 

The EnsembleVis framework was developed to explore 
ensembles of weather forecast models. This work uses 
summary statistics to manage the display of the large, 
time-varying data set. Because of its richness, it was 
important to tease out a number of summaries across 
various dimensions of the data. Rather then decid- 
ing upon a single visualization technique to show all 
aspects of the data, the system links multiple displays 
to allow the user to simultaneously explore a variety 
of data characteristics. Thus, the EnsembleVis frame- 
work, as shown in plate 22, combines a number of 
aggregation windows, each designed to answer a spe- 
cific question. The main window presents a summary 
of the data across the spatial domain. Alongside the 
main window, a number of smaller windows show sum- 
maries across other dimensions, such as a filmstrip 
view of individual time steps, graphs of the individual 
model responses for a data subset, and a query dialog 
to filter results according to specific parameters. This 
collection of visualizations reveals insight into general 
trends of the model ensemble and highlights outlier 
model runs. This system provides weather scientists 
with the ability to explore the outputs of the simula- 
tion to understand the consensus of the ensemble, the 
probability of the consensus outcome, and model con- 
figurations that lead to outlying results and to iden- 
tify biased models that may hint at errors in the model 
construction or missing atmospheric phenomena. 

5.3 Bioelectric Fields 

Bioelectric fields are produced by the living cells 
responsible for the action of muscles and the trans- 
mission of information in nerves. Many different imag- 
ing modalities have been developed to assess this activ- 
ity, including electrocardiography of the heart and elec- 
trocardiography and magnetoencephalography of the 
brain. In addition to their clinical applications, these 
imaging techniques provide the basis for a computa- 
tional reconstruction of the physiological activity at the 
origin of the electric signal. This allows researchers to 
gain insight into the mechanisms and consequences of 


conditions such as myocardial ischemia: a shortfall in 
blood supply that can, in extreme cases, lead to a com- 
plete blockage of blood, resulting in a heart attack. To 
explore the electric fields and electric current densities 
associated with the cardiovascular and cerebral activity 
in humans, advanced vector visualization techniques 
for visual analysis have been developed. 

To understand the bioelectric field in the direct vicin- 
ity of the epicardium of the heart, stream surfaces are 
used to capture the geometry of the current induced by 
the cardiac source (see plate 23(a)). The surfaces also 
provide an effective representation of the interconnec- 
tions that exist between different regions on the heart’s 
surface. A rainbow color map is used along each curve 
to visualize the stretching of the return current as it 
propagates through the torso. 

The three-dimensional electric current within the 
brain can be visualized using texturing techniques 
to show the asymmetry of the electric patterns (see 
plate 23(b)). Textures, computed on a clipping plane, 
reveal the dipolar source of electric current and its 
interaction with the surrounding tissue. The electric 
current is clearly diverted by the presence of white mat- 
ter tracts that lie close to the source. The field also 
changes direction very rapidly as it approaches the 
skull just beneath the surface of the head. 

Focusing efforts on vector-field visualizations of 
electric current, rather than scalar representations of 
potentials, offers a more meaningful global depiction 
of the continuous flow. This permits a deeper under- 
standing of the three-dimensional shape of the bio- 
electric sources and their fields. Such an approach pro- 
vides new insight into the impact of tissue character- 
istics, such as directional dependence, on the resulting 
bioelectric fields. 

6 Outlook for Visualization 

The greatest successes in visualization come when 
researchers are able to explore much more informa- 
tion than previously possible. Often, tools designed for 
specific applications turn out to address general prob- 
lems and can be applied to other domains and appli- 
cations. With the many software packages, prototype 
examples, and tool suites that are available, visualiza- 
tion is becoming accessible at a variety of levels, and we 
expect to see it become an integral part of the scientific 
work flow. 
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VII.14 Electronic Structure Calculations 
(Solid State Physics) 

Eric Cances 


1 Introduction 

The study of the electronic properties of solids has 
led to major scientific discoveries (superconductivity, 
the quantum Hall effect, giant magnetoresistance), as 
well as to a large number of applications that have rev- 
olutionized our daily lives: computers and electronic 
devices being prime examples. 

Besides its importance in terms of applications, the 
modeling of the electronic structure of solids is an inex- 
haustible source of fascinating problems in mathemat- 
ical physics, the analysis of partial differential equa- 
tions, numerical analysis, and scientific computing. 

In solid state physics, the fundamental components 
of matter are atomic nuclei and electrons in Coulomb 


interaction. Starting with the chemical composition of 
a material— that is, the number of atoms of each chem- 
ical element that are present— it is possible to write 
down the n-body Schrodinger equation encoding most 
of its physical, and all of its chemical, properties. Unfor- 
tunately, this equation is much too complicated to 
be solved: indeed, it reads as a 3 n -dimensional par- 
tial differential equation, where n is the total num- 
ber of particles (atomic nuclei and electrons) in the 
material under consideration. For a macroscopic solid, 
n is of the order of 10 20 or larger, so the numer- 
ical solution of the n-body Schrodinger equation is 
way out of reach of even the most powerful comput- 
ers conceivable at the present time. Several approx- 
imations are therefore adopted. The first, called the 
Born-Oppenheimer approximation, is based on the fact 
that nuclei are thousands of times heavier than elec- 
trons, and it can be mathematically justified by means 
of adiabatic limits, letting the mass ratio between elec- 
trons and nuclei go to zero. According to the Born- 
Oppenheimer approximation, it is possible to compute 
the electronic structure of the system at each time t 
by solving a time-independent Schrodinger equation, 
parametrized by the (time-dependent) positions of the 
nuclei. The second step consists of replacing the elec- 
tronic time-independent Schrodinger equation by sim- 
pler models that are amenable to numerical simulation. 
Electronic structure calculation is concerned with the 
design and simulation of such models. 

The purpose of this article is to introduce the reader 
to 

• independent-particle models, which enable us to 
qualitatively understand the electronic structure of 
crystals, and 

• the Kohn-Sham model, based on density functional 
theory. 

The latter model allows us to run numerical simula- 
tions that are in quantitative agreement with experi- 
mental data. It can be used to predict the properties 
of new molecules, materials, and nanostructures and 
therefore has an extremely broad range of applications. 
For instance, in the field of energy technology, it can be 
used to design new materials for nuclear power plants, 
fuel cells, or solar cells. 

2 Electronic States 

Any quantum system is characterized by a Hamilto- 
nian: that is, a self-adjoint operator H that acts on some 
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hilbert space [1.2 §19.4] For independent-electron 
models, as well as for mean-field models derived from 
density functional theory, H is a Schrodinger operator 
of the form 

H = -\A + V(r) 

that acts on the Hilbert space L 2 (R 3 ) of complex- valued 
square-integrable functions on R 3 , where 

dx 2 + 3 y 2 + 3 z 2 

is the Laplace operator and where V is a real-valued 
function on R 3 . We adopt the system of atomic units 
here, in which ft = 1, e = 1, m e = 1, and 4 tteo = 1, 
where ft is the reduced Planck constant, e the elemen- 
tary charge, m e the mass of the electron, and eo the 
dielectric permittivity of the vacuum. 

The SPECTRAL THEORY OF SELF-ADJOINT OPERATORS 
[IV.8] is the key mathematical tool in electronic struc- 
ture calculation. In some sense, a self-adjoint operator 
on L 2 (R 3 ) can be seen as an infinite-dimensional Her- 
mitian matrix. A finite -dimensional Hermitian matrix 
A G C nxn has exactly n real eigenvalues (taking mul- 
tiplicities into account), and can be diagonalized in an 
orthonormal basis set. By definition, the spectrum of A 
is the set cr(A ) of the complex numbers A such that 
A I - A is noninvertible. A finite-dimensional square 
matrix is noninvertible if and only if it is noninjec- 
tive. Therefore, cr(A) is also the set of the eigenvalues 
of A: namely, the set of the complex numbers A for 
which there exists x G C” \ {0} such that Ax = Ax. 
On the other hand, as L 2 (R 3 ) is an infinite-dimensional 
Hilbert space, the set of the eigenvalues of H (the point 
spectrum of H) is, in general, a strict subset of the 
spectrum cr(H) of H. For instance, the point spectrum 
o-p(Ho) of the kinetic energy operator Ho = - \a, act- 
ing on L 2 (R 3 ), is empty (there is no nonzero function 
<p G L 2 (R 3 ) such that Ho <fi = E<p for some E G C), while 
its spectrum <j(Hq) is equal to [0,+co). The elements 
of cr (Ho) can be interpreted as generalized eigenvalues 
since, for each k g R 3 , 



The plane wave r ■- ek(r) = e lk r satisfies the equa- 
tion iToejc = Eek with E = j]fc| 2 , but it is not a 
true eigenfunction, since it does not belong to the 
space L 2 (R 3 ). It is called a generalized eigenfunction. In 
physics terminology, true eigenmodes are called bound 
states, while generalized eigenmodes are called scatter- 
ing states. The nature and the location of the spectrum 
of a Schrodinger operator H = -^A + V depends on the 


potential V. For V (r) = 3 cu 2 |r| 2 (a three-dimensional 
harmonic oscillator), the spectrum of H is pure point: 
that is, it is composed of only the true eigenvalues 
E n = (n+ |)co, n G N. For V(r) = -Z/\r\, ZgN* 
(a hydrogen-like ion), the spectrum of H consists of an 
infinite sequence of true eigenvalues E n = -Z 2 / (2n 2 ), 
n G N*, and of a continuum [0, +oo) of generalized 
eigenvalues. 

The purpose of electronic structure calculation is to 
identify the electronic ground state, and possibly the 
lowest-energy excited states, of a molecular system. In 
the framework of independent-electron models, these 
states are easily obtained from the spectral decompo- 
sition of the Hamiltonian H. Assume for simplicity that 
the bottom of the spectrum of H consists of an increas- 
ing sequence of eigenvalues £i £2 < Ep (taking 

multiplicities into account) and that the molecular sys- 
tem of interest contains an even number N = 2p of 
electrons. The ground state energy and the density are 
then given by 

p p 

Eo = 2^£i and p 0 (r) = 2 Y, l<M»')l 2 , 

i= 1 i=l 

respectively, where (4>i)i^i^ P is an orthonormal set 
of eigenfunctions of H associated with the eigenvalues 


£i,...,s p : 




£i4>i and 



In physics, the eigenvalues e, are called energy lev- 
els. The ground state is therefore obtained by putting 
the electrons in the lowest energy levels, under the 
constraint that there are at most two electrons per 
energy level. This constraint, called the Pauli princi- 
ple, is related to the fact that electrons are Fermions 
of spin \. We will not elaborate further on the concept 
of spin here and simply mention that, from a mathe- 
matical perspective, the spin labels the projective rep- 
resentations of the rotation group SO(3). Excited states 
correspond to other distributions of the N electrons 
in the energy levels. For instance, if e p+ i > e p , the 
ground state is nondegenerate and the energy of the 
first excited state is £1 = 2 XfJi' e* + e p + e p+ i (see 
figure 1). 


3 Noninteracting Electrons in Crystals 

As in solid state physics textbooks, we will first focus 
on the case of perfect crystals, which are periodic 
arrangements of atoms. We will show in particular that 
a simple model of noninteracting electrons in a per- 
fect crystal allows us to qualitatively understand why 
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Figure 1 (a) The ground state and (b) the first excited 
state for N = 8 electrons (p = 4 electron pairs). 

some crystalline materials are insulators while others 
are conductors. The mathematical framework for inves- 
tigating the electronic structure of perfect crystals is 
Bloch’s theory, the basics of which are sketched below. 
We will then see how to model crystals with defects and 
disordered materials such as doped semiconductors or 
alloys. 

In noninteracting electron models, a perfect crystal is 
characterized by a lattice £ (a discrete subgroup of R 3 ) 
and an T -periodic potential Vp er : M 3 — R. For the sake 
of simplicity we will deal with the case when L = Z 3 , 
so that 

VR e Z 3 , Vr e R 3 , V per (r + R) = V per (r), 

and we denote the unit cell by E = [- j) 3 - Introduc- 
ing the translation operators tr defined by (t r<P) ( r ) = 
4>(r - R), the above periodicity condition can be for- 
mulated as TflVper = V P er for all lie Z 3 . The quantum 
Hamiltonian 

1 1 per = — j + V pe r 

is called a periodic Schrodinger operator. Under some 
local integrability assumptions on V pe r, it is self-adjoint 
on I 2 (R 3 ). It has no true eigenvalues (cr p (H pe r) = 0), 
and its spectrum is bounded below and composed of a 
countable number of possibly overlapping intervals of 
the real line, called bands. To establish this fundamen- 
tal result, which is key to explaining insulating, semi- 
conducting, and conducting behaviors, we need Bloch’s 
theory. 

Denoting by T* = [— tt, 7i) 3 the Brillouin zone asso- 
ciated with the lattice Z 3 , any function u 6 L 2 (R 3 ) can 
be decomposed as 

u(r) = rT~TT I Uq(r)e lci r dq, 

(2tt )•* Jr* 


where 

u q (r) = ^ u(r + R)e~ 1 ‘ ll ' r+R) . 

ReZ 3 

For each q e T*, the function u q is inl 2 er , the space of 
complex-valued, Z 3 -periodic, locally square-integrable 
functions. This decomposition, which is reminiscent of 
the Fourier transform, is called the Bloch-Floquet trans- 
form. Bloch’s theorem states that, if an operator A on 
I 2 (R 3 ) commutes with the translations of the lattice Z 3 
(i.e., if it satisfies Atr = trA for all R e Z 3 ), then there 
exists a family of operators (A q ) qe r* on I 2 er such that 

[Au)(r) = 1 3 [ (A q u q )(r)e lci r dq. 

(ztt ) 3 Jr* 

The above formula means that the operator A is block 
diagonalized by the Bloch-Floquet transform. The oper- 
ators (H P er)q have simple expressions: 

(Hper)q — — \q ■ V + 2 \q\^ + V pe r- 


For each q e r*, the operator (H per ) q can be diagonal- 
ized in an orthonormal basis set of I 2 r , 


(EI p er) q^Pn.q — En,q ( f > n,q 1 



= s 


mnt 


and the sequence (E n ,q)n^i of its eigenvalues con- 
verges to + 00 . Besides, as the mapping q — ■ (H per )q 
is analytic, the eigenvalues e n ,q can be indexed in such 
a way that £ 1,0 ^ £2,0 ^ ■ and the mapping q E n , q 

is even and continuous (and, in fact, analytic in each 
direction). As a consequence (see figure 2), 


0-((Ho) per ) = U U = U IZn.Znl 

qer* 1 n^l 

with 

E^ = min s nA and E£ = max£„,q. 
qEr* ' qEr* 

The interval [E P ,E^] is called the nth band of the spec- 
trum of H P er. If T n+1 > E„, the interval {En,E~ +1 ) is 
called a spectral gap. It is easily checked that the func- 
tions ipn.qir) = 4’n,q(t')e lc,r are generalized eigen- 
functions of 1 1 per- 


EiperVn.q — En.qtpn.q- 

The functions ip n ,q are not periodic but quasiperiodic, 
in the sense that for each R s Z 3 , (t Rip nA )(r) = 
e lq R ip n ,q (*")■ They are called Bloch waves. 

The electronic ground state is then obtained by filling 
the lowest energy levels. In this particular setting, this 


850 


VII. Application Areas 



t t t t 

Valence bands Conduction bands 


Figure 2 The electronic structure of an insulating 
crystal with N = 4 electrons per unit cell. 


means that, if the crystal contains N electrons per unit 
cell, the ground state energy per unit volume and the 
ground state density are, respectively, given by 


Eo 



Po(r) = 2 



3-C {£ f - £n,q)£n,qdq, 

F - £n,q ) 1 1 Pn.q ( r ) | 2 dq, 


where J-C is the Heaviside function (J-f ( x ) = 0 if x < 0; 
3~C (x) = 1 if x ^ 0). In the above formulas, £f is a real 
number, called the Fermi level, such that 2 ^V(ef) = N, 
where the function 


N(E) = 



X(E- £ n ,q) dq 


denotes the so-called integrated density of states. 

Filling one band therefore corresponds to putting 
two electrons into each unit cell. One of two situations 
then arises (see figure 3). The first corresponds to the 
case when N = 2p is even and there is a gap g := 
Sp +1 - Ep > 0 between the pth and the (p + 1 ) st band. 
In this case, the lowest p bands (the valence bands) are 
completely filled, while the other bands (the conduct- 
ing bands) are empty. The minimum energy required to 
excite an electron from the valence bands to the con- 
ducting bands, in which electrons are free to travel, is 
then equal to g. In the second situation (N = 2p - 1, or 
N = 2p and the pth and ip + l)st bands overlap), the 
pth band is not completely filled and an infinitesimal 
amount of energy is sufficient to create an electronic 
excitation. In the former case the crystal behaves as an 



Figure 3 Spectra of (a) an insulating perfect crystal with 
N = 4 electrons per unit cell, (b) a conducting perfect crystal 
with N = 6 electrons per unit cell, and (c) an insulating 
crystal with a local defect. 


insulator (large gap) or a semiconductor (small gap), 
while in the latter case it behaves as a metal. 

Perfect crystals do not exist in nature. Besides the 
fact that real crystals are of course finite, the arrange- 
ment of atoms in such materials is not perfectly peri- 
odic. Indeed, real crystals contain both local defects 
(vacancies, interstitial atoms, impurities, dislocation 
loops) and extended defects (dislocation lines, grain 
boundaries). For the sake of brevity we focus on local 
defects in this article. 

A single local defect (or a finite number of them) is 
modeled by a perturbed periodic Schrodinger operator 
H = - jA + Vper + W, where IF is a potential that van- 
ishes at infinity. The spectrum of H contains the spec- 
trum of Hper but may also contain discrete eigenvalues 
located in the spectral gaps of Hp el (see figure 3). These 
eigenvalues correspond to bound states localized in the 
vicinity of the defects, and they play an important role 
in physics. 

Doped semiconductors are key materials in electron- 
ics. Their properties are due to the presence of impuri- 
ties randomly distributed throughout the crystal. A few 
impurities per million atoms can dramatically modify 
the electronic properties of the crystal. Doped semi- 
conductors can be modeled by random Schrodinger 
operators of the form H w = - |A + V w , with 

V w ir) = X [d - w Jt)v(r - R) + w R w(r - fl)], 
Reft 

where v and w are compactly supported functions and 
the cor are independent, identically distributed ran- 
dom variables. Consider, for instance, the case when 
u)r is a Bernoulli random variable, meaning that u)r = 
1 with probability p and (jor = 0 with probability 
1 - p. If p = 0, the potential is periodic (equal to 
Srgr v(r - R)). On the other hand, the operator H w 
depends on the realization of the random variables 
u)r. Similar stochastic models can be used to describe 
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alloys. The study of the spectral properties of random 
Schrodinger operators is an active held of research. It 
is known that the spectrum of H cc is almost surely 
independent of the realization ut = {(jor)r gz 3. This 
result is in fact a consequence of the Birkhoff ergodic 
theorem of probability theory. A fundamental result 
is that disorder tends to localize the electrons. While 
there are no bound states for p = 0, true eigenvalues 
that correspond to bound states appear in the vicin- 
ity of the edges of the bands as soon as p > 0. This 
phenomenon, called Anderson localization , is central 
to physics. Many mathematical questions about the 
spectral properties of random Schrodinger operators 
remain open. 

4 Density Functional Theory 

Let us finally turn to the modeling of interacting elec- 
trons in crystals. We limit ourselves to the case of per- 
fect crystals. Extending these models to the cases of 
crystals with defects and alloys is at the edge of current 
research. 

The introduction of the Kohn-Sham model in the 
mid-1960s revolutionized the modeling and simula- 
tion of the electronic structure of solids. Indeed, this 
model allows us to obtain simulation results that are 
in quantitative agreement with experimental data. 

The Kohn-Sham model is a mean-field model that 
allows computation of the ground state density. For a 
perfect crystal, the Kohn-Sham Hamiltonian reads as 
follows: 

ttPo _ 1 a 1 t/C,Po , t ,xc,po 

c^per — Cper -h Vper , 

where kpef 0 and Vper Po are, respectively, the Coulomb 
and exchange-correlation potentials. The former is the 
electrostatic potential generated by the total (nuclear 
and electronic) charge distribution. It is obtained by 
solving the Poisson equation 

-AVper 0 = 4tt(po - p nuc ), Vper 0 K-periodic, 

where p nuc is the periodic nuclear charge distribution 
and po is the periodic electronic ground state density. 
Several expressions for Vpel Po have been formulated by 
physicists and chemists, based on theoretical physics 
arguments, and parametrized by quantum Monte Carlo 
simulations of the homogeneous (interacting) electron 
gas. The simplest model for Vpef 0 is the Xa potential 
introduced by Slater, for which 

Vgf°(r) = -Cp 0 (r) 113 , 


where C is a given positive constant. Much more elab- 
orate forms for Vpel Po are used in practice. As in the 
noninteracting case, the electronic ground state den- 
sity is obtained by filling up the lowest bands of the 
periodic Schrodinger operator Hper . But the crucial dif- 
ference is that, this time, the Hamiltonian Hper depends 
on the ground state density. The Kohn-Sham model is 
therefore a nonlinear eigenvalue problem. Note that the 
Kohn-Sham equations are the Euler equations (the first- 
order optimality conditions) of a constrained optimiza- 
tion problem consisting of minimizing the Kohn-Sham 
energy functional on the set of admissible electronic 
states. 

From a numerical point of view, the ground state 
can be obtained either by minimizing the Kohn-Sham 
energy functional or by solving the Kohn-Sham equa- 
tions by an iterative procedure called a self-consistent 
field algorithm. The Kohn-Sham model for perfect crys- 
tals is usually discretized as follows. First, the Bril- 
louin zone E* is meshed using a regular grid Qi of 
step length Aq = 2 rr/L, L 6 N*. Then, for each 
q 6 Ql = Aql? n E* , the lowest-energy eigen- 
values and eigenfunctions of the operator (H pe r )q 
are computed numerically by a Rayleigh-Ritz approx- 
imation in the space spanned by the Fourier modes 

( e fc)fc G 27«3,|fcKy2£c’ where £c is an ener gy cu toff. 
This amounts to diagonalizing the Hermitian matrix 

(<efc| (Hper)q Wk') ^ k,k' g2tt7? , |fc|< v / 2£t, Ik'K^/ZE^’ where 
(fifcl (-Hper)q |ejc') 

= | e k (r)*((Hp a ) q e k A(r)dr 

= +^V P er(r)e i(fc '- k) '-dr. 

When the two numerical parameters of the simula- 
tion, L and E c , go to infinity, the numerical results (the 
ground state energy per unit volume and the ground 
state density) converge to the exact solution of the 
Kohn-Sham model. 

Further Reading 

Kohn, W. 1999. Nobel Lecture: Electronic structure of 
matter— wave functions and density functionals. Reviews 
of Modern Physics 71:1253-66. 

Martin, R. M. 2004. Electronic Structure: Basic Theory > 
and Practical Methods. Cambridge: Cambridge University 
Press. 

Reed, M., and B. Simon. 1978-1981. Methods of Modern 
Mathematical Physics, volumes I-IV. New York: Academic 
Press. 
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VII. 15 Flame Propagation 

Moshe Matalon 


1 Introduction 


A fundamental problem in combustion theory is the 
determination of the propagation speed of a premixed 
flame through a gaseous combustible mixture. In a pre- 
mixed system the fuel and oxidizer are already mixed, 
so following ignition a chemical reaction takes place 
and spreads into the remaining unburned mixture until 
one of the reactants is completely depleted. The propa- 
gation speed depends on whether the combustible mix- 
ture is quiescent or in a state of motion and on whether 
the underlying flow is laminar or turbulent. Its practical 
importance is evident, allowing, for example, the deter- 
mination of the mean fuel consumption rate in a com- 
bustor or an estimation of the spread of, and damage 
caused by, an explosion in a combustible system. 

It is seldom the case that the original reactants in 
a combustion system interact among themselves in 
a single step and produce the final product. In most 
cases there is a large number of steps with intermediate 
species involved before the final formation of products. 
It is convenient, however, to use a global representation 

fuel + v oxidizer — (1 + v) products + {Q} 


for the chemical description. Accordingly, a mass v of 
oxidizer is consumed for each unit mass of fuel, pro- 
ducing a mass (1 + v) of products and releasing a ther- 
mal energy Q. The differential equation describing the 
mass balance of fuel is 


DTf to 

— f- -DfV 2 T f = , 

Dt p 


( 1 ) 


where the operator D/Dt = d/dt+v -V is the convective 
derivative, i.e., D</>/Dt is the rate of change of the prop- 
erty 4> of a material element taken while following the 
fluid motion, with v the velocity vector and V the gra- 
dient operator taken with respect to the three spatial 
coordinates (the dot signifies the inner product). Here, 
p is the density of the entire mixture, the fuel mass frac- 
tion Tf is the ratio of the mass of fuel to the total mass 
of the mixture, and the coefficient Z> F is the diffusivity 
of fuel relative to the bulk (nitrogen for combustion in 
air). In lean mixtures, where the reaction consumes only 
a very small amount of oxidizer, the oxidizer mass frac- 
tion Yo can be treated as constant; otherwise, it is an 
additional variable, and a similar equation to (1) must 


be written for its consumption. The differential equa- 
tion for the energy balance expressed in terms of the 
temperature T is 


DT 

Dt 


- <xV 2 T = 



( 2 ) 


where tx is the thermal diffusivity and c p is the spe- 
cific heat (at constant pressure) of the mixture. Equa- 
tions (1) and (2) state that fuel is consumed and energy 
released (fuel and energy are the two main ingredients 
of any combustion system) at a rate tv. The reaction 
rate typically obeys an Arrhenius law of the form 

to = BpYFe _£/Rr , 


with E the activation energy (the minimum energy 
required for the reaction to be possible) and X the 
universal gas constant. 

The large temperature variations within the com- 
bustion field produce large density variations, which, 
in turn, modify the flow field. Since flames propagate 
very slowly compared with sound waves, the process 
is nearly isobaric and the density is inversely propor- 
tional to the temperature. The velocity v obeys the 
Navier- Stokes equations 
Dp 


„ + pV ■ v 
Dt F 

Du „ 
p — - = -Vp + 
Dt 


0 , 


pV ■ Z 


( 3 ) 

( 4 ) 


describing conservation of mass and momentum. Here, 
p is the dynamic pressure (the small deviations from 
the ambient pressure), Z is the viscous stress tensor, 
and p is the viscosity of the mixture. 

Despite the numerous simplifications introduced in 
the above formulation, its mathematical complexity is 
apparent, involving coupled partial differential equa- 
tions that, in addition to the quadratic nonlinearity 
of ordinary fluid flow problems, contain the highly 
nonlinear exponential term. Recent advances in com- 
putational capabilities permit such problems to be 
addressed. However, the computations are quite inten- 
sive and are usually carried out for a particular set 
of parameters. Moreover, when dealing with multidi- 
mensional flows and turbulence, they are often unable 
to provide the required spatial and temporal resolu- 
tions. Fundamental understanding has been achieved 
primarily by analyzing simplified mathematical mod- 
els that elucidate the physical interactions taking place 
between the various mechanisms involved in a given 
process. Because of the disparities between the spa- 
tial scales and timescales involved in combustion prob- 
lems, the techniques of perturbation and asymptotic 
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methods have become the primary tools of mathemat- 
ical analysis. 


2 The Planar Adiabatic Flame 


One of the simplest problems in combustion is a planar 
flame propagating into a quiescent mixture of temper- 
ature T u , under adiabatic conditions. The solution, of 
the form <p = <fi(x + Sit), corresponds to a combustion 
wave propagating from right to left along the x-axis at 
a speed Si that is uniquely determined by the thermo- 
dynamics and the transport and chemical kinetic prop- 
erties of the combustible mixture. The determination 
of 5 l, known as the laminar flame speed, has been the 
subject of a large number of theoretical, numerical, and 
experimental investigations. An analytical solution is 
not available, but an asymptotic approximation can be 
obtained when E/HT U » 1, i.e., when the activation 
energy of the chemical reaction is much larger than 
the energy of the fresh mixture, a condition that char- 
acterizes most combustion systems. In this limit, the 
reaction is confined to a thin region like that illustrated 
in figure 1. Elsewhere, the chemical reaction rate to is 
negligible either because the exponential factor is van- 
ishingly small (the preheat zone) or because the fuel 
has been entirely consumed (the post-flame zone). The 
asymptotic solution is then constructed by obtaining 
solutions of the simplified equations in each of the 
separate regions. Asymptotic matching then yields an 
expression for the flame speed that, for a lean mixture 
under the adopted simplifications, is of the form 


S L 


where 


Pb 2 cx 2 B c -£/ 2 ftr„ 

Pu 25f E(T a — T u ) 


(5) 


Ta = r u + Qy Fu /c p (6) 

is the (adiabatic) flame temperature, a direct conse- 
quence of energy conservation. The subscripts “u” and 
“b” identify conditions in the unburned and burned 
states, respectively. The flame speed and temperature 
given by (5), (6) are two of the most important proper- 
ties that characterize a premixed flame. The flame tem- 
perature increases with increasing heat release Q, and 
its appearance in the highly temperature-dependent 
exponential in (5) implies that it exerts the strongest 
influence on the flame speed, meaning that reactions 
with larger values of heat release propagate faster. The 
flame speed is also affected by preheating the mixture, 
i.e., increasing T u , or diluting the gas with an inert sub- 
stance whose properties can change the thermal and/or 
mass diffusivities remarkably. 


r a 



v= 0 ► v=(g- 1)S L 

Unburned gas * Burned gas 


Reaction 

zone 

Figure 1 The structure of a planar premixed flame. 

The mechanism of propagation is associated with the 
heat conducted back from the reaction zone, precipi- 
tating a rise in temperature in the adjacent gas layer 
that triggers the chemical reaction. Since, in a one- 
dimensional flow, the mass flux relative to the wave is 
constant, the gas traveling through the flame expands 
and speeds up. The hot burned gas moves away in the 
direction opposite to flame propagation at a velocity 
larger than 5 l by a factor (cr - 1), where cr = p u /Pb is 
the thermal expansion parameter. 

The realization nearly fifty years ago that the pla- 
nar propagation problem could be solved by means 
of asymptotic techniques paved the way for much of 
the theoretical development that has taken place in 
recent years, particularly the mathematical description 
of unsteady multidimensional laminar and turbulent 
flames, which is discussed next. 

3 Multidimensional Laminar Flames 

Although planar flames can be observed in the labo- 
ratory if appropriate measures are taken, real flames 
are seldom flat. A Bunsen flame, for example, main- 
tains a conical shape when the gas velocity at the exit 
of the burner, V, is greater than Si. The geometry then 
determines the cone opening angle 0 = 2 sin _1 (5L/V). 
The shape of the flame stabilized in the laboratory 
around a porous sphere is affected by natural con- 
vection and resembles a teardrop that lacks spherical 
symmetry. Buoyancy plays a significant role when the 
representative Froude number Fr = V 2 L/g is small; 
here, V is a characteristic flow velocity, I is a mea- 
sure of the flame height, and g is gravitational accel- 
eration. A more fundamental reason why flames are 
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Figure 2 A steadily propagating corrugated flame. 


often corrugated and propagate in an unsteady man- 
ner is associated with instabilities, the most prominent 
of which, in premixed flames, is the hydrodynamic, 
or Darrieus-Landau (DL), instability. The gas expan- 
sion induces hydrodynamic disturbances that tend to 
enhance perturbations of the flame front. Diffusion 
effects that often have stabilizing influences act only 
on the short-wavelength disturbances, so large flames 
acquire cusp-like conformations with elongated intru- 
sions pointing toward the burned gas region, as shown 
in figure 2. These structures are stable and, because 
of their larger surface area, propagate at a speed Ui 
that is substantially larger than the laminar flame speed 
5 l- The dashed curve in the figure is the flame front, 
the solid curves are selected streamlines, and the vari- 
ous shades of gray correspond to regions of increased 
velocity. The flow pattern demonstrates the deflection 
of streamlines upon crossing the flame front, which is 
a consequence of gas expansion, and the induced vor- 
tical motion in the unburned gas that is otherwise at 
rest, which is responsible for sustaining the cellular 
structure by “pushing” the crests upward. 

The mathematical description of multidimensional 
flames exploits the disparity among the length scales 
associated with the fluid dynamic held, the diffusion 
processes, and the highly temperature-sensitive reac- 
tion rate. On the largest scale, the entire flame, con- 
sisting of the preheat and reaction zones, may be 
viewed as a surface of discontinuity separating burned 
from unburned gases, described by ip(x,t) = 0. 


Equations (3)-(4), with densities p u , pb, must then be 
solved on either side of the flame front with appro- 
priate jump conditions for the pressure and velocities 
across ip = 0. The internal flame structure facilitated 
by the large-activation-energy assumption is resolved 
on the smaller diffusion length scale, and through 
asymptotic matching it provides the aforementioned 
jump relations in the form of generalized rankine- 
hugoniot relations [V.20 §2.3], as well as giving us 
an expression for the flame speed. It is customary in 
combustion to define the flame speed Si relative to the 
unburned gas, namely, Si = v£ - Vi, where v n = v ■ n 
denotes the normal component of the gas velocity and 
Vf = -iptl I V (// 1 is the propagation speed with respect 
to a fixed coordinate system; here, the unit normal n 
is taken positive when pointing toward the burned gas 
region, the asterisk superscript indicates that the veloc- 
ity is to be evaluated at the flame front on its unburned 
side, and the subscript t stands for time differentiation. 
The expression for the flame speed takes the form 

Si = 5l - £K, 17) 

where K is the stretch rate, a measure of the flame front 
deformation that results from the normal propagating 
motion and the nonuniform underlying flow field. A 
spherically expanding flame is stretched because of the 
increase in its surface area that occurs at a rate propor- 
tional to the instantaneous curvature. A planar flame 
stabilized in a stagnation point flow is stretched as a 
result of the diverging flow at a rate proportional to the 
hydrodynamic strain. For a general surface, the local 
stretch rate can be obtained from kinematic considera- 
tions and is found to depend on the (mean) curvature of 
the flame surface k and the underlying hydrodynamic 
strain, namely, 

K = kSl - n ■ E ■ n, 

where E is the strain rate tensor. The coefficient £, 
known as the Markstein length, is of the order of the 
flame thickness and incorporates the effects of diffu- 
sion and chemical reaction. It can take positive or neg- 
ative values depending on the mixture composition. In 
an experimental setting, changes in £ could be accom- 
modated by varying the fuel type and mixture composi- 
tion or the system's pressure. The mathematical formu- 
lation is thus composed of a nonlinear free-boundary 
problem for the pressure and velocity fields with the 
free surface (the flame front) determined by solving 

ipt + v*- Vip = Sf| V(//|, 


( 8 ) 
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with Sf given by (7). This formulation, known as the 
hydrodynamic theory, has been a useful framework for 
understanding intricate flame-flow interactions and, in 
particular, the development of flame instabilities. The 
flame shown in figure 2, for example, was obtained 
using a variable -density Navier-Stokes solver in con- 
junction with a front-capturing technique. 

4 Turbulent Flames 

Analogous to the definition of the laminar flame speed, 
the turbulent flame speed St can be defined as the 
mean propagation speed of a premixed flame within 
an isotropic homogeneous turbulent field of zero mean 
velocity. If the incident fluid velocity is decomposed 
into a mean (denoted by an overline) and a fluctuating 
component, and the flame is held statistically station- 
ary by adjusting the mean longitudinal flow velocity uj, 
then the turbulent flame speed is St = Ui, as illustrated 
in figure 3. The mean mass flow rate through the entire 
flame shown in the figure is given by m = p u ASj. Since 
ali the reactants pass through the wrinkled flame of 
area Af, it can also be calculated from the total contri- 
butions of mass flowing through the differential seg- 
ments comprising the wrinkled flame, assuming each 
segment propagates normal to itself at a speed Si. We 
then have m = p u AfSL, implying that St/Sl = Af/A; 
i.e., the increase in the speed of the turbulent flame 
is due to the increase in the surface area of the flame 
front. This relation was first noted by Damkohler, who 
resorted to geometrical arguments with analogy to a 
Bunsen flame to further deduce an explicit dependence 
on the turbulence intensity vf, defined as the root mean 
square of the velocity fluctuations. His result and those 
of numerous other phenomenological studies conform 
to expressions of the form 

Sj /Si = 1 + C(v' c /S L ) n , (9) 

with various constants C and adjustable exponents 
n. Although the experimental record exhibits a wide 
scatter due to the variable accuracy of the methods 
and the varied operating conditions, the data appears 
to confirm that St increases with increasing intensity 
of turbulence for low to moderate values of vf. At 
higher turbulence intensities, however, St increases 
only slightly and then levels off, an observation known 
as the bending effect. 

The hydrodynamic model provides a more rigor- 
ous approach to the determination of the turbulent 
flame speed. The thin-flame assumption implies that 
the internal structure of the flame is not disturbed 



Figure 3 A schematic of a statistically 
stationary turbulent premixed flame. 


by the turbulence and retains its laminar structure. 
The results of this model are, strictly speaking, appli- 
cable to only the wrinkled and corrugated flamelet 
regimes of turbulent combustion, which encompass 
many combustion applications and most laboratory 
experiments. It excludes, in particular, the distributed- 
reaction regime, where the basic structure of a flame 
no longer exists and the notion of turbulent flame 
speed becomes ambiguous. Since in the hydrodynamic 
description every segment of the wrinkled flame prop- 
agates at a speed Sf that depends on the local mixture 
composition and flow conditions, as given by (7), the 
mean mass flowing through the entire flame is now 
m = PuSfAf. This yields 

S T = SfTWI, (10) 

which must be determined through calculations simi- 
lar to the one used to generate figure 2 after replac- 
ing the quiescent setting with a turbulent flow field. 
A pregenerated homogeneous isotropic turbulent flow, 
characterized by intensity vf. and integral scale f, is 
thus fed as an inflow at the bottom of the integration 
domain, as shown in figure 4(a). The flame is retained 
at a prescribed location, on average, by controlling the 
mean inflow velocity. Figure 4 illustrates such calcu- 
lations. The turbulent nature of the flow is elucidated 
by the clockwise/counterclockwise (dashed/solid) vor- 
ticity contours, and the flame front is represented by 
the solid dark curve. One notes the vorticity generated 
downstream near the “cusp” of the flame front by baro- 
clinic torque and a significant decrease in the vorticity 
elsewhere; this is the result of volumetric expansion. 

If the local flame speed is assumed to be constant and 
equal to the laminar flame speed, (10) yields St/Sl = 

| V <// 1, implying that the increase in the speed of the 
turbulent flame is due to the increase in the flame sur- 
face area, as in Damkohler’s proposition. Equation (7) 
in conjunction with (10) demonstrates that in turbu- 
lent propagation an important role is played by the 
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Figure 4 Flame propagation in turbulent flow; the instant- 
aneous portrait shows a pocket of unburned gas that 
pinched off from a folded segment of the flame surface. 


mean stretching of the flame K, which influences the 
mean local flame speed Sf and, consequently, St- Scal- 
ing laws accounting for stretch effects were recently 
obtained using numerical simulations within the con- 
text of the hydrodynamic theory, for “two-dimensional” 
turbulence and for £ > 0. Although two-dimensional 
flow lacks some features of real turbulence, the results 
extend the current understanding of turbulent flame 
propagation and yield expressions for St that are 
free of any turbulence modeling assumptions and/or 
ad hoc adjustment parameters. The condition £ > 0 
excludes the development of the thermodiffusive insta- 
bilities that are observed in rich hydrocarbon-air or 
lean hydrogen-air mixtures; instabilities that may fur- 
ther contaminate the flame surface with small struc- 
tures increase its overall surface and its propagation 
speed. Three regimes have been identified depending 


on the mixture composition, the thermal expansion 
coefficient, and the turbulence intensity: 

(i) a regime in which, on average, the flame brush 
remains planar (i.e., has zero mean curvature) and 
is unaffected by the DL instability; 

(ii) a regime in which the DL effects, which are respon- 
sible for frequent intrusions of the flame front into 
the burned gas region, have a marked influence on 
the flame front, which remains partially resilient to 
turbulence; and 

(iii) a highly turbulent regime in which the influences of 
the DL instability play a limited role and the flame 
propagation is totally controlled by the turbulence. 


The third regime is affected by frequent folding of the 
flame front and formation of pockets of unburned gas 
that detach from the main flame surface and are rapidly 
consumed, causing a significant reduction in the turbu- 
lent flame speed. Expressions for the turbulent flame 
speed of the form 


ST 

S L 


^ = 1 - 


£K 


Sl 


:][ b(£,£)c(cr) 

“Jr (£/L) m \sj \ 

have been obtained, with coefficients that depend on 
the functional parameters in the various operating 
regimes. The constant a depends on the nature of the 
stable laminar flame when v' c = 0; it is equal to 1 when 
the stable flame is planar and equal to Ui /S l when the 
stable flame is the cusp-like conformation (shown in fig- 
ure 2) resulting from the DL instability. The coefficient c 
increases with increasing cr and plateaus as the thermal 
expansion reaches sufficiently high values. The coeffi- 
cient b is of relatively small magnitude, reaching a max- 
imum at an intermediate scale that disturbs the flame 
most effectively. Variations in the mixture composition 
and ambient conditions, exhibited through the Mark- 
stein length £, appear as a power law with exponent m 
less than 1. Of greatest significance is the dependence 
of St on turbulence intensity; at low v' c the depend- 
ence of St on turbulence intensity is quadratic (n = 2), 
in accordance with Damkohler’s heuristic result, but 
at higher turbulence levels the dependence is sublin- 
ear (n < 1), which explains the bending effect that is 
observed experimentally. 
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VII. 16 Imaging the Earth Using Green’s 
Theorem 

Roel Snieder 

1 Introduction 

The Earth is a big place: its radius is about 6400 km. 
In comparison, the deepest boreholes so far drilled are 
about 10 km deep. We therefore have little opportu- 
nity to take direct measurements or samples inside 
the Earth: it is mostly inaccessible. And even for the 
upper 10 km that we can sample, the cost of drilling 
deep boreholes is very high. This means that inferences 
about the Earth’s interior are largely based on physical 
and chemical measurements taken at the Earth’s sur- 
face, or even from space. Investigating the inside of the 
Earth therefore resembles that classical black-box prob- 
lem: determine the contents of a closed box when you 
can do anything except open the box. 

If one could measure physical fields, such as the grav- 
itational field or the elastic wave field, inside the Earth, 
one could infer the local properties of the Earth by 
inserting the measured held into the equation that gov- 
erns that held and then extracting the physical parame- 
ters, such as the mass density, from the held equation. 
However, the helds are measured at, or sometimes even 
above, the Earth’s surface. One therefore needs a recipe 
for propagating the measured held from its surface of 
observation into the Earth's interior. This is a problem 
where mathematics comes to the rescue in the form of 
Green's theorem. This relates measurements taken at 
a surface bounding a volume to the helds inside that 
volume: a principle called downward continuation. 

In what follows we apply Green’s theorem to a large 
class of physical systems and show that the theorem 
only relates measurements at a surface to measure- 
ments in the interior when the equations are the same 
regardless of whether one moves toward the future or 
toward the past. Such equations are said to be invari- 
ant for time reversal. We focus in particular on seismic 
imaging because this is the technique that provides the 
highest spatial resolution. 


2 Green’s Theorem for General Systems 


Consider physical systems that satisfy the following 
partial differential equation for a field u(r,t) that is 
excited by sources q(r, t): 

N 

X = V ' ( £ (»')Vm (r,t)) + q(r, t). (1) 

n = 0 


This equation captures many specific equations. An 
example is the wave equation: 


1 d 2 u 
K(r) dt 2 


yir)d it =v -{w) Vu ) +q{r ’ t) ’ (2) 


where k is the bulk modulus, y is a damping parameter, 
and p is the density. Another example of equation (1) 
is the diffusion equation: 

du(r t ] = v . ( D ( r )S7u(r,t)) + q(r,t), ( 3 ) 


with D(r) the diffusion constant. This equation is used 
to describe flow in porous media such as aquifers 
and hydrocarbon reservoirs. It also accounts for heat 
conduction and for diffusive spreading of pollutants. 
A variant of (3) is the Schrodinger equation, which 
accounts for the dynamics of microscopic particles: 

~ T(r)i p(r,t) = -^-V 2| //(r, t). (4) 

Here h is Planck’s constant divided by 2tt, m is the 
mass of the particle, and V(r) is the real potential in 
which the particle moves. Yet another example is the 
gravitational potential, which plays an important role 
in geophysics because it helps constrain the mass den- 
sity inside the Earth. The gravitational held satisfies 
Poisson’s equation: 


0 = V 2 u(r) - 4nGp(r), (5) 


where G is the gravitational constant. This equation 
does not depend on time. 

Note that the applications (2)-(5) are special forms 
of the general equation (1). In these applications B(r) 
is real, hence we use B = B* in the following, with the 
asterisk denoting complex conjugation. We also use the 
Fourier convention: f(t) = J/(co)e~ ltot dot. With this, 
the general equation (1) reduces to 


N 

^ (-iw) n a n (r)u(r,uj) 
n = 0 

= V ■ (B(r)Wu(r, w)) + q(r, co). (6) 

Each time derivative is replaced by a multiplication 
by -ico. The treatment that follows is valid in the 
frequency domain. For brevity we omit the frequency 
dependence of variables. 
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Let us consider two different field states, up(r) and 
UQ(r), excited by sources qp(r) and qQ(r), respec- 
tively. Take equation (6) for state P, multiply by Uq, 
and integrate over volume. Next take the complex con- 
jugate of (6) for state Q, multiply by up, and integrate 
over volume. Subtracting these two volume integrals 
and applying Green’s theorem to the terms containing 
B(r) gives 


X (-im) n JA n upu^dV 

n = 0 

= § B (^ u Q- u p d -^) ds + 

(7) 


where 3/3 n denotes the outward normal derivative to 
the surface S that bounds the volume over which we 
integrate, and 

JVn(r) = a n (r)~ i-l) n a*(r). (8) 


The Green function G(r,ro), defined as the solution 
of (6) to a point excitation, q(r) = S(r - ro), plays a 
key role in what follows. An important property of G is 
reciprocity: G(r,r') = G(r',r). Under suitable bound- 
ary conditions, this property is valid for all applications 
that follow. 

Consider the case where up is source free (qp = 0) 
and uq is excited by a point source, q(±(r) = S(r-rQ). 
Then uq(v) = G(r,ro) = G(rQ,r). Using this in (7), 
denoting up by u, and replacing r with r' and vq with 
r gives 


u(r) = - 


N 

X (-iw) n JA n (r')G*(r,r')u(r') dV' 

n = 0 


+ <> B(r')^G*(r,r')^ 


, 3 u 3 G*(r,r') 


3 n’ 


dS'. 


(9) 


3 Moving the Field into the Interior 

Equation (9) is a powerful tool for propagating mea- 
surements taken at the boundary of a system into the 
interior of that system. This is of particular importance 
in earth science. First we illustrate this principle for 
the acoustic wave equation (2), which is a prototype of 
the equations that govern seismic imaging. In the nota- 
tion of (1), the wave equation (2) has N = 2, = 1/k, 

«i = y, a o = 0, and B = 1/p. The coefficients a n enter 
(9) in a volume integral through the term A n defined in 
equation (8), which equals 

{ 2ilm(a„) for n even, 

( 10 ) 

2Re(a n ) for n odd, 


where Re and Im denote the real and imaginary parts, 
respectively. According to (10), U 2 does not contribute 
because i< is real. Consider first the case when there is 
no attenuation, so that a\ = y = 0 and (9) reduces to 
the representation theorem: 


u(r) 



3 G*(r,r') \ 
3 n’ U ) 


dS'. (11) 


This expression relates measurements at the surface in 
the integral on the right-hand side to the wave field in 
the interior on the left-hand side. 

What happens if there is attenuation? In that case, 
fli = y > 0, and according to (9) and (10), equa- 
tion (11) must be extended with a volume term 2im x 
J y(r')G* (r,r')u(r') dV' on the right-hand side. This 
term contains the wave field in the interior that we seek 
to determine, so that this field does not follow from 
measurements at the surface only. In principle, atten- 
uation makes seismic imaging impossible, but in prac- 
tice, attenuation in the Earth is weak and the offending 
volume integral can therefore be ignored. 

Can diffuse fields be imaged? For the diffusion equa- 
tion (3) the only nonzero terms in (1) are a i = 1 and 
B = D. Inserting these into (9) gives 


u(r) = 2ito J" G* ( r , r')u(r') dV' 
+ j>D(r')( 


G*(r,r')p i- aG * (r ’ r ') 


' 3 n' 3 n' U ) dS ' 

Just as for attenuating acoustic waves, the right-hand 
side contains the unknown field in the interior. This 
means that measurements of diffusive fields taken at 
the surface cannot be used for imaging using Green’s 
theorem. 

The Schrodinger equation (4) is first order in time, 
and for this reason one might think that as for the dif- 
fusion equation one cannot infer the field values within 
a volume from measurements taken at the boundary. 
For this equation, N = 1, ai = i h, ao = -V, and 
B = -ft 2 / (2m). According to (9) and (10), and assum- 
ing that the potential V is real, the volume integral 
depends on Im(ao) = Im(-V) = 0 for n = 0 and 
on Re(ai) = Re(ift) = 0 for n = 1. The volume inte- 
gral therefore vanishes and field values in the interior 
can be determined from field values measured at the 
boundary. 

For the gravitational potential, field equation (5), all 
a n = 0 and B = 1, so that in a source-free region the 
field satisfies 

, , Hr, >^ Su 3G(r, r') 

u(r) = i lG{r,r )— — — u(r)JdS. (12) 
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The potential held does not depend on time, and as a 
result both the held u and the Green function G are 
real functions; there are therefore no complex conju- 
gates in (12). This expression makes it possible to infer 
the gravitational held above the Earth when the held 
is known at the Earth’s surface. Expression (12) can 
be used for upward continuation, where one infers the 
gravitational held at higher elevations from measure- 
ments taken at the Earth’s surface. This can be used, for 
example, to compute the trajectories of satellites. Sim- 
ilarly, one can use this expression for downward con- 
tinuation, where one computes the gravitational held at 
lower elevations from measurements taken higher up. 
An application of downward continuation is to infer the 
gravitational held at the Earth’s surface from measure- 
ments taken from satellites or aircraft. It is, however, 
not possible to use (12) to compute the gravitational 
held inside the Earth. In the interior, the mass density 
p(r) is nonzero, and according to (5), the source q(r) 
is nonzero. This violates the assumption q P = 0 used in 
the derivation of (9). For this reason, Green’s theorem 
cannot be used to infer the mass density in the Earth 
from measurements taken at the surface. 

In general, the property that the held in the interior 
follows from held measurements taken at the boundary 
is valid for systems that are invariant for time rever- 
sal. These are systems that obey equations that are 
invariant when time is reversed and t is replaced by 
-t. This is true for the wave equation in the absence 
of attenuation, but attenuation breaks the symmetry 
between past and future. The diffusion equation is not 
invariant under time reversal; heat diffuses away when 
moving forward in time. Like the diffusion equation, 
Schrodinger’s equation is hrst order in time, and one 
might think it is not invariant for time reversal. One can 
show, however, that, when ip (r, t) is a solution, then so 
is ip*(r,-t). According to the principles of quantum 
mechanics, one cannot make a distinction between the 
wave function and its complex conjugate, and there- 
fore the equation is effectively invariant for time rever- 
sal and, as we have seen, measurements at the surface 
suffice to determine the field in the interior. 

4 Seismic Imaging 

In this section we discuss the application of the rep- 
resentation theorem (11) to seismic imaging. A typical 
marine seismic experiment is shown in figure 1. In the 
figure a ship tows a streamer (shown by the dashed 
line), which is a long tube with hydrophones (pressure 



Figure 1 The geometry of a marine seismic survey. 


sensors) and/or geophones (motion sensors) that act 
as recording devices. An air gun (a device dehvering an 
impulsive bubble of air) acts as a seismic source just 
behind the ship. The waves reflected by layers in the 
Earth are recorded by sensors in the streamer. 

The water surface is a free surface, hence the pres- 
sure p vanishes there: p(z = 0) = 0. However, the par- 
ticle motion does not vanish. According to Newton’s 
law, the acceleration a is related to the pressure by 
pa = -Vp. The vertical component of this expression 
is given by 

pa z = - (13) 
dz 

We use this relation in the representation theorem (11) 
for the pressure p. For the boundary we take the combi- 
nation of the sea surface So and a hemisphere with 
radius R (figure 1). In the presence of a tiny amount of 
attenuation, the pressure p and Green function G decay 
as e aR , with a an attenuation coefficient, and the con- 
tribution of Soo vanishes as R — oo. The closed surface 
integral thus reduces to the contribution of the free 
surface So- Since that surface is horizontal, the normal 
derivative is just the derivative in the -z-direction. (We 
have chosen a coordinate system with positive z point- 
ing down.) As the pressure vanishes at the free surface, 
expression (11) reduces to 

p(r) = -J^ p -1 (r')G*(r,r')^ 5 ^ r ^ dS'. 
Eliminating dp/dz' using (13) gives 

p(r, to) = f G*(r,r', w)a z (r', to) dS', (14) 
JSo 

having restored the frequency dependence. This for- 
mula relates the pressure in the subsurface to the 
motion recorded at the sea surface. 
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The reader may have wondered why the complex 
conjugate of G was used, since most of the ex- 
pressions also hold when the complex conjugation 
is not applied. The time-domain Green function is 
related to the frequency-domain Green function by 
G(r,r',t) = I G(r,r' , U))e~ lu>t duo, and hence the 

time-reversed Green function satisfies G(r,r',-t) = 
jG*(r,r',(«)e _1 “ t dw. This means that G*{r,r',uj) 
corresponds in the time domain to the time-reversed 
Green function G(r,r',-t). As a consequence, equa- 
tion (14) corresponds in the time domain to 

p(r,t) = f G(r,r' , -t) * a z (r' ,t) dS r , (15) 

Js 0 

where the star ( * ) denotes convolution. The Green func- 
tion G(r, r' , t) is causal, meaning that it is only nonzero 
after the point source acts at t = 0. It then moves the 
waves forward in time away from the point of excita- 
tion. Consequently, the time-reversed Green function 
G(r,r',-t) is nonzero only for t < 0, and it prop- 
agates the wave backward in time. In (15), the time- 
reversed Green function G(r,r', —t) is convolved with 
the recorded acceleration. This means that it takes the 
waves that are recorded at the streamer and propagates 
them backward in time. This is a desirable property: 
in order to find the reflectors in the Earth, one needs 
to know the wave field at the moment when it was 
reflected off the reflectors. The recorded waves thus 
need to be propagated back in time so that we know 
them at earlier times when they were reflecting inside 
the Earth. This is the reason why the time-reversed 
Green function is used, and ultimately this is the reason 
why the theory presented here used the complex conju- 
gate G*(r,r',tv) instead of G ( r , r' , to ) . If we had used 
G(r,r',t) instead of G(r,r' ,-t), equation (15) would 
have given the pressure field inside the Earth after it 
has been recorded at the receivers. This field does not 
give information about the interaction of waves with 
reflectors before the waves propagated to the surface 
where they are recorded. For this reason the theory in 
section 3 is based on G* rather than G. 

5 A Chicken and Egg Problem 

As shown here, Green’s theorem makes it possible to 
infer the value of a physical field in the interior of 
the Earth from measurements taken at, or above, the 
Earth’s surface. There is, however, a catch. In order 
to downward continue fields measured at the Earth’s 
surface, one must know the Green function: see, for 
example, (14). For the wave equation (2), the space and 


time derivative fields are multiplied by the mass den- 
sity and bulk modulus of the Earth, respectively. The 
Green function needed for downward continuation of 
seismic waves thus depends on the properties of the 
Earth, but it is these properties that one seeks to deter- 
mine. We therefore need the properties of the Earth to 
determine the properties of the Earth! 

Fortunately, there is a way out of this conundrum. It 
turns out that for seismic imaging it suffices to have 
an estimate of the Green function that positions the 
wavefronts at more or less the correct location. Such an 
estimated Green function is computed from a smooth 
velocity model. The velocity used is obtained from a 
procedure called velocity estimation, where one deter- 
mines a smooth velocity model from measured arrival 
times from reflected seismic waves. The success of the 
seismic method in the hydrocarbon industry shows 
that this procedure works in practice. 
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VII. 17 Radar Imaging 

Margaret Cheney and Brett Borden 


1 Background 

“Radar” is an acronym for radio detection and ranging. 
Radar was originally developed as a technique for 
detecting objects and determining their positions by 
means of echolocation, and this remains the principal 
function of modern radar systems. Radar can provide 
very accurate distance (range) measurements, and can 
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also measure the rate at which this range is changing. 
However, radar systems have evolved over more than 
seven decades to perform an additional variety of very 
complex functions; one such function is imaging. 

Radar imaging has much in common with optical 
imaging: both processes involve the use of electro- 
magnetic waves to form images. The main difference 
between the two is that the wavelengths of radar are 
much longer than those of optics. Because the resolving 
ability of an imaging system depends on the ratio of the 
wavelength to the size of the aperture, radar imaging 
systems require an aperture many thousands of times 
larger than optical systems in order to achieve com- 
parable resolution. Since kilometer-sized antennas are 
not practicable, fine-resolution radar imaging has come 
to rely on so-called synthetic apertures, in which a small 
antenna is used to sequentially sample a much larger 
measurement region. 

There are many advantages to using radar for remote 
sensing. Unlike many optical systems, radar systems 
can be used day or night. Because the long radar wave- 
lengths pass through clouds, smoke, etc., radar systems 
can be used in all weather conditions. Moreover, some 
radar systems can penetrate foliage, buildings, dry soil, 
and other materials. 

Radar waves scatter mainly from objects and features 
whose size is on the same order as the wavelength. 
This means that radar is sensitive to objects whose 
length scales range from centimeters to meters, and 
many objects of interest are in this range. 

Radar has many applications, both military and civil- 
ian. Radar systems are widely used in aviation and 
transportation, for navigation, for collision avoidance, 
and for low-altitude flight. Most of us are familiar with 
police radar for monitoring vehicle speed. Radar is 
also used to monitor weather, including Doppler mea- 
surements of precipitation and wind velocity. Imaging 
radar is used for land-use monitoring, for agricultural 
monitoring, and for environmental monitoring. Radar- 
based techniques are used to map the Earth's surface 
topography and dynamic evolution. Medical microwave 
tomography is currently under development. 

2 Mathematical Modeling 

Mathematical modeling is based on maxwell’s equa- 
tions [III.22], or, more commonly in the case of propa- 
gation through dry air, the scalar approximation: 

(v 2 - =5(t,X), (1) 


where T(t,x) denotes the (electric) field transmitted 
and measured by the radar, x and t are position and 
time variables, and c denotes the speed of light in 
vacuum. In constant-wave-velocity radar problems, the 
source 5 is a sum of two terms, 5 = 5 ln + 5 SC , where 
5 m models the source due to the transmitting antenna, 
and 5 SC models the effects of target scattering. The solu- 
tion r E to equation (1), which is written as £ tot , there- 
fore splits into two parts: £ tot = X m + £ sc . The first 
term, 1 m , satisfies the wave equation for the known, 
prescribed source 5 ln , usually corresponding to the cur- 
rent density on an antenna. This part we call the inci- 
dent field; it is the field in the absence of scatterers. The 
second part of £ tot is due to the presence of scattering 
targets, and this part is called the scattered field. 

One approach to finding the scattered field is to 
simply solve (1) directly using, for example, numerical 
time-domain techniques. For many purposes, however, 
it is convenient to reformulate the scattering problem 
in terms of an integral equation. 

In scattering problems the source term 5 SC represents 
the target’s response to an incident field. This part of 
the source function will generally depend on the geo- 
metric and material properties of the target and on the 
form and strength of the incident field. Consequently, 
5 SC can be quite complicated to describe analytically. 

Usually, one makes the Born or single-scattering 
approximation, namely 

s sc (t,x) = J V(x)'E m (t' ,x) dt' , (2) 

where V(x) is called the reflectivity function and 
depends on target orientation. This results in a linear 
formula for T sc in terms of V: 


£ sc (t,x) » Xs(t,x) 

= jj g(t - t,x - z)V(z) 1 m (T, z) dr dz, (3) 


where g is the outgoing fundamental solution or (out- 
going) Green function: 


0(t,x) 


5(t - \x\/c) 
4tt|x| 


e -ico(t-|x|/c) 

8tt 2 |x| 


do). 


(4) 


Here, |x| = J~x^~x. 

The Born approximation is very useful because it 
makes the scattering problem linear. It is not, however, 
always a good approximation. 

The incident field in (3) is typically of the form 
of an antenna beam pattern multiplied by the con- 
volution g * p, where g denotes the Green function 
(4) and p denotes the waveform fed to the antenna. 
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Consequently, we obtain the following model for the 
scattered field: 


£l c (t,x u ) = 


p(t - 2|x° - z\/c) 


V (z) dz, (5) 


(4tt\x° - z | 

where we have neglected the antenna beam pattern. 
Under the Born approximation, the scattered field can 
be viewed as a superposition of scattered fields from 
targets that are point-like (i.e., V(z') °c 5(z - z')) in 
the sense that they scatter isotropically. No shadowing, 
obscuration, or multiple scattering effects are included. 

Equation (5) is an expression for the field, which is 
proportional to the voltage measured on a receiving 
antenna. Note that the received power, which is pro- 
portional to the square of the voltage, is proportional 
to l/R 4 , where R = |x° - z| is the distance between the 
antenna at position x° and the scatterer at position z. 
This 1 /i? 4 dependence is the reason that radar signals 
are typically extremely weak and are often swamped 
by thermal noise in the receiving equipment. Radar 
receivers typically correlate the incoming signal with 
the transmitted pulse, a process called pulse compres- 
sion or matched filtering , in order to separate the signal 
from the noise. 

Radar data do not normally consist simply of the 
backscattered field. Radar systems typically demod- 
ulate the scattered field measurements to remove 
the rapidly oscillating carrier signal and convert the 
remaining real-valued voltages to in-phase (I) and 
quadrature (Q) components, which become the real and 
imaginary parts of a complex-valued analytic signal (for 
which the signal phase is well defined). For the pur- 
poses of this article, however, we ignore the effects 
of this processing and work simply with the scattered 
field. 


3 A Survey of Radar Imaging Methods 

Synthetic-aperture radar (SAR) imaging relies on a num- 
ber of very specific simplifying assumptions about 
radar scattering phenomenology and data-collection 
scenarios. 

• Most imaging radar systems make use of the start- 
stop approximation, in which both the radar sensor 
and the scattering object are assumed to be sta- 
tionary during the time interval in which the pulse 
interacts with the target. 

• The target or scene is assumed to behave as a rigid 
body. 

• SAR imaging methods assume a linear relationship 
between the data and the scene. 


Different geometrical configurations for the sensor 
and target are associated with different terminology. 

3.1 Inverse Synthetic -Aperture Radar 

A fixed radar system staring at a rotating target is 
equivalent (by change of reference frame) to a station- 
ary target viewed by a radar moving (from pulse to 
pulse) on a circular arc. This circular arc will define, 
over time, a synthetic aperture, and sequential radar 
pulses can be used to sample those data that would be 
collected by a much larger radar antenna. Radar imag- 
ing based on such a data-collection configuration is 
known as inverse synthetic-aperture radar (ISAR) imag- 
ing. This imaging scheme is typically used for imag- 
ing airplanes, spacecraft, and ships. In these cases, the 
target is relatively small and is usually isolated. 

The small-scene approximation, namely 

\x - yl = |x| - x ■ y + (6) 

where x denotes a unit vector in the direction x, is often 
applied to situations in which the scene to be imaged 
is small in comparison with its average distance from 
the radar. This approximation is valid for x| » \y\. 

Using (6) in (5) and shifting the time origin show 
that, when the small-scene approximation is valid, the 
radar data are approximately the Radon transform of 
the reflectivity function. 

3.2 Synthetic-Aperture Radar 

SAR involves a moving antenna, and usually the an- 
tenna is pointed toward the Earth. For an antenna view- 
ing the Earth, we need to include a model for the 
antenna beam pattern, which describes the directivity 
of the antenna. For highly directive antennas, we often 
simply refer to the antenna “footprint,” which is the 
illuminated area on the ground. 

Most SAR systems use a single antenna for both 
transmitting and receiving. For a pulsed system, we 
assume that pulses are transmitted at times t n , and we 
denote the antenna position at time t n by y„. The data 
can then be written in the form 

2Tg c (f,n) = J'e _1 " [f_2|) '" _yl/i::l A((o, n,y) dcoV(y) dy, 

(7) 

where A incorporates the geometrical spreading fac- 
tors |x° - y\~ 2 , the transmitted waveform, and the 
antenna beam pattern. (More details can be found in 
Cheney and Borden (2009).) 
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Because the timescale on which the antenna moves is 
much slower than the timescale on which the electro- 
magnetic waves propagate, the timescales have been 
separated into a slow time, which corresponds to the n 
of t n , and a fast time t. 

The goal of SAR is to determine V from radar data 
that are obtained from the scattered field E sc by I/Q 
demodulation (mentioned at the end of section 2) and 
matched filtering. Again, for the purposes of this arti- 
cle, we neglect the processing done by the radar system 
and work simply with the scattered held. 

Assuming that y and A are known, the scattered held 
(7) depends on two variables, so we expect to form a 
two-dimensional image. For typical radar frequencies, 
most of the scattering takes place in a thin layer at the 
surface. We therefore assume that the ground reflec- 
tivity function V is supported on a known surface. For 
simplicity we take this surface to be a hat plane, so that 
V(x) = V(x)5(x 3), where x = (xi,X 2 ). 

SAR imaging comes in two basic varieties: spotlight 
SAR and stripmap SAR. 

3.2.1 Spotlight SAR 

Spotlight SAR is illustrated in figure 1. Here, the mov- 
ing radar system stares at a specihc location (usually on 
the ground) so that at each point in the flight path the 
same target is illuminated from a different direction. 
When the ground is assumed to be a horizontal plane, 
the constant-range curves are large circles whose cen- 
ters are directly below the antenna at y n . If the radar 
antenna is highly directional and the antenna footprint 
is sufficiently far away, then the circular arcs within the 
footprint can be approximated as lines. Consequently, 
the imaging method is mathematically the same as that 
used in ISAR. 

As in the ISAR case, the time-domain formulation of 
spotlight SAR leads to a problem of inverting the Radon 
transform. 

3.2.2 Stripmap SAR 

Stripmap SAR is illustrated in figure 2. Just as the time- 
domain formulations of ISAR and spotlight SAR reduce 
to inversion of the Radon transform, which is a tomo- 
graphic inversion of an object from its integrals over 
lines or planes, stripmap SAR also reduces to a tomo- 
graphic inversion of an object from its integrals over 
circles or spheres. 



Figure 1 In spotlight SAR the radar is trained on a particular 
location as the radar moves. In this figure the equirange 
circles (dotted lines) are formed from the intersection of 
the radiated spherical wavefront and the surface of a (flat) 
Earth. 



Figure 2 Stripmap SAR acquires data without staring. The 
radar typically has fixed orientation with respect to the 
flight direction and the data are acquired as the beam 
footprint sweeps over the ground. 


3.2.3 Interferometric SAR 

Interferometric SAR is a sort of binocular radar imag- 
ing system that can provide height information. These 
systems use two antennas that create separate SAR 
images. These images are complex, and height infor- 
mation is encoded in the phase difference between the 
two images. 

4 Future Directions for Research 

In the decades since the invention of SAR imaging, there 
has been much progress, but many open problems still 
remain. In particular, as outlined at the beginning of 
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section 2, SAR imaging is based on specific assump- 
tions that may not be satisfied in practice. When they 
are not satisfied, artifacts appear in the image. Conse- 
quently, a large number of the outstanding problems 
can be grouped into two major areas. 

Problems related to unmodeled motion. Both SAR 
and ISAR are based on known relative motion be- 
tween target and sensor, e.g., including the assump- 
tion that the target behaves as a rigid body. When 
this is not the case, the images are blurred or unin- 
terpretable. 

Problems related to unmodeled scattering physics. 

The Born approximation leaves out many physical 
effects, including not only multiple scattering and 
creeping waves but also shadowing, obscuration, and 
polarization changes. Neglecting these effects can 
lead to image artifacts. But without the Born approx- 
imation (or the Kirchhoff approximation, which is 
similar), the imaging problem is nonlinear. 
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VII. 18 Modeling a Pregnancy Testing Kit 

Sean McKee 


1 Device Description 

In this article we describe a general medical diagnos- 
tic tool, based on antibody/antigen technology, whose 


principal, and certainly most lucrative, application is 
as a pregnancy testing kit. The fluorescence capillary- 
fill device (FCFD) consists of two plates of glass sepa- 
rated by a narrow gap. The lower plate is coated with 
an immobilized layer of specific antibody and acts like 
an optical waveguide. The upper plate has an attached 
reagent layer of antigen (or hapten 1 ) labeled with a 
fluorescent dye. When the sample is presented at one 
end of the FCFD, it is drawn into the gap by capil- 
lary action and dissolves the reagent. The fluorescently 
labeled antigen in the reagent now competes with the 
sample antigen for the limited number of antibody sites 
on the lower glass plate (see figure 1). The FCFD plate 
structure may be regarded as a composite waveguide: 
the intensities of the distinct optical paths, depending 
on whether they originate from a fluorescent molecule 
that is free in the solution or from a molecule bound 
close to the surface of the plate, are picked up by a 
photodetector. 

We shall not be concerned with the optical aspects 
of this device but rather with the competitive reaction 
between the antigen and the fluorescent antigen for 
those antibody sites, and we shall focus on its use as a 
pregnancy testing kit. When a woman is pregnant the 
presence of a “foreign body” causes the production of 
antigen molecules (X), which are then countered by spe- 
cific antibodies (Y). Denoting the fluorescently labeled 
antigen by Xf, figure 2 (representing the “blow-ups” in 
figure 1) displays the situation when a woman is, and 
is not, pregnant both at time t = 0 (when the sample 
is first presented to the device) and at f = tf (when the 
reaction is finished). Upon completion, a light is shone 
down the lower plate, and, if the woman is pregnant, 
the beam of greater light intensity will be detected at a 
larger angle to the plate (as shown in figure 1). 

2 Mathematical Model 

When a sample (urine in this case) is presented to 
the FCFD, it dissolves the labeled antigen but not the 
antibody that is fixed on the lower wall. The device 
is then left in a stationary position to allow the anti- 
gen and the labeled antigen to diffuse across the gap 
and compete for the antibody sites. Thus the mathe- 
matical model consists of two one-dimensional diffu- 
sion equations coupled through nonlinear and nonlo- 
cal boundary conditions, and a number of conservation 


1. Haptens are low-molecular-weight molecules that contain an 
antigenic determinant; these have substantially larger diffusion coef- 
ficients than the antigen molecules. 
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light to 
photodetector 

Figure 1 A schematic of the fluorescent capillary-fill 
device. More details can be found in Badley et al. (1987). 


relationships. Since the equations themselves are lin- 
ear, Laplace transforms may be employed to rechar- 
acterize the system as two coupled nonlinear Volterra 
integro-differential equations: in nondimensional form 
they are 

^p(t) = yi<^(l - uq(t) - w 2 (t)) 

x (p - m J ( ^-(r)K(5(t - t)) d-rj 

-Liuq(t)], 

( 1 ) 

^p(t) = y2 [d - m(t) - w 2 (t)) 

x ^g(t) - m J < ^(t )K(t - t) dT^ 

-L 2 w 2 (t) j. 

( 2 ) 


Here, Wi(t) and w 2 (t) denote the concentrations of the 
complexes XY and XfY, respectively, while S, g, m, y i, 
and y 2 are nondimensional constants. The functions 
g(t) and K(t) are given by 


g(t) = 1 - e~ At + 2A X 


(-D" ,„-n 2 n 2 t „-A t\ 


(A - n 2 rr 2 ) 


(e 


- e -At ), 


Km ~ TStf 1 + 2 1, 


\/TT t 

where A is a further nondimensional constant repre- 
senting the dissolution rate of the bound Xp. 


(a) TQ x F x F x F x F 

X X 

X X 

Y Y Y Y Y 


(b) X F X F X F X F X F 


Y Y Y Y Y X f Y X f Y X f Y X f Y X f Y 

f = 0 f = ff 

Figure 2 (a) Woman is pregnant, (b) Woman is not pregnant. 
Here, X denotes the sample antigen, X F the antigen or hap- 
ten with a fluorescent label, and Y the specific antibody for 
the antigen or hapten. 


X F X F 

X 

X F 

X F Y XpY XY XY XY 


Equations (1) and (2) admit a regular perturbation 
solution for small m. 

Extending results of Jumarhon and McKee, a further 
recharacterization may be obtained in the form of a 
system of four coupled, nonlinear, (weakly) singular 
Volterra integral equations of the second kind: the four 
dependent variables in this case, in nondimensional 
form, are 

[X] (1, t), [Xp] (1, t), [ [X](x, t)dx, 

Jo 

and 

f [Xf](x, t) dx, 

Jo 

where [X] denotes the concentration of the antigen X, 
etc. It is from this system that one is able to deduce 
(global) existence and uniqueness of a solution of the 
original diffusion problem, though the proofs are not 
trivial. 

It is also from these characterizations that one is able 
to obtain small- and large-t asymptotic results; interest- 
ingly, the large-f results require the explicit solution of 
a quartic. 


3 Design Considerations 

The objective of this work was to provide a quanti- 
tative design tool for the bioscientists. Indeed, this 
model— or, more precisely, an earlier considerably sim- 
plified model— was ultimately employed in the develop- 
ment of Clearblue, the well-known pregnancy testing 
kit. The model (and its associated code) allowed bio- 
scientists to see that the device could be made small 
(and, consequently, be produced very cheaply) and in 
large batches, suitable for hospital use. Furthermore, it 
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provided an indication of both the plate separation dis- 
tance and how much antibody and labeled antigen were 
required to be affixed to the plate surfaces. In short, 
the tool obviated a great deal of experimentation, thus 
saving time and allowing Unilever Research, through 
a company that was then called Unimed, to bring the 
product to market early. 

Further Reading 
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P. R. Stephenson. 1987. Optical biosensors for immuno- 
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nonlinear and nonlocal boundary conditions. Journal of 
Mathematical Analysis and Applications 190:806-20. 
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VII. 19 Airport Baggage Screening with 
X-Ray Tomography 

W. R. B. Lionheart 


1 The Security Screening Problem 

In airport security, baggage carried in an aircraft’s hold 
needs to be screened for explosive devices. Tradition- 
ally, an X-ray machine is used that gives a single two- 
dimensional projection of the X-ray attenuation of the 
contents of the bag. In some systems, several views 
are used to reveal threats that may be obscured by 
large dense objects. An extension of this idea is to use 
a large number of projections to reconstruct a three- 
dimensional volume image of the luggage. This is the 
X-ray computed tomography (CT) technology that is 
familiar from medical applications (see medical imag- 
ing [VII.9]). The image can then be viewed by an oper- 
ator from any desired angle and threat-detection soft- 
ware can be used that, for example, segments the vol- 
ume image, identifying objects that have a similar X-ray 
attenuation to explosives. In some airports, a two-stage 
system is used in which bags that cannot be cleared 
by an automatic system analyzing a two-dimensional 
projection are then passed to a much slower X-ray 
tomography system. 

Airport baggage handling systems operate using con- 
veyor belts traveling at around 0.5 m/s. Medical CT 
machines use a gantry supporting the X-ray source 


and an array of detectors that rotates in a horizontal 
plane while the patient is translated in the direction 
of the rotation axis. Relative to the patient, the source 
describes a helical trajectory. By contrast, small labora- 
tory CT machines rotate and translate the sample while 
the source and detector remain fixed. Neither of these 
is practical for scanning luggage at the desired speed: 
the mass of the gantry is too great to rotate fast enough 
and rotating the bag would displace the contents. 

2 Real-Time Tomography 

The company Rapiscan Systems has developed a sys- 
tem called real-time tomography (RTT) that uses multi- 
ple X-ray sources fixed in a circular configuration that 
can be switched electronically, removing the need for a 
rotating gantry. A cylindrical array of detectors is used, 
and this is coaxial with the sources, but the X-rays can- 
not penetrate the detectors, so the detectors are offset 
relative to the sources (see figure 1). 

A reasonable mathematical model of X-ray tomogra- 
phy is that the line integrals of the linear attenuation 
are measured for all lines joining sources to detectors. 
For a helical source trajectory and detectors covering a 
set called the Tam-Danielson window, there is an exact 
reconstruction algorithm due to Katsevich expressed in 
terms of derivatives and integral operators applied to 
the data. Most medical and industrial CT machines use 
an approximation of this using overdetermined data. 

The RTT presents several mathematical challenges. 

• The data is incomplete, resulting in an ill-posed 
INVERSE PROBLEM [IV.15] to solve. 

• The reconstruction must be completed quickly to 
ensure the desired throughput of luggage. 

• The sources can be bred in almost any order. In 
fact, sequential bring that approximates a rotat- 
ing gantry and a single-threaded helix trajectory 
is the most difficult due to heat dissipation issues 
in clusters of sources. What is the optimal bring 
order? 

3 Sampling Data and Sufficiency 

As the RTT was originally conceived as a fast helical 
scan machine, it is natural to think of the sources br- 
ing sequentially at equal time intervals as a discrete 
approximation to a curve. A bring sequence in which 
a fixed number of sources is skipped at each time step 
approximates a multithreaded helix. Figure 2 illustrates 
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Figure 1 A cartoon of an RTT scanner showing 
the source circle and detector cylinder. 


two possible firing orders, the first approximating a 
multithreaded helix. The second could be interpreted 
as a multithreaded helix (with two different pitches), 
but it is more natural to view it as a triangular lattice 
that samples the two-dimensional surface of a cylinder 
rather than a curve. 

The manifold of lines in three-dimensional space is 
four dimensional and data from the X-ray transform of 
a function of three variables satisfies a consistency con- 
dition called John’s ultrahyperbolic equation. In con- 
ventional helical CT, the lines through the helix in the 
Tam-Danielson window are sufficient data to solve the 
reconstruction problem (and consequently solve the 
Dirichlet problem for John’s equation). We can inter- 
pret the RTT data in figure 1 intersected with the detec- 
tor array as a discrete sampling scheme for an open 
subset of the four-dimensional space of lines. Using 
the Fourier slice theorem for the Radon plane trans- 
form along with the Payley-Wiener theorem, we can 
see that for continuum data the inverse problem has 
a unique solution, but an inversion using this method 
would be highly unstable. In a practical problem with 
a discrete sampling scheme and noisy data, we would 
expect to need some regularization for a numerically 
stable solution. 

4 Inversion 

The inversion of the RTT data can be considered as the 
solution of a sparse linear system of equations relat- 
ing the attenuation coefficients in voxels within the 
region of interest to the measured data. As such we 
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Figure 2 (a), (b) Two multithreaded helix source firing 
sequences; (b) a lattice on the cyhnder of source positions, 
(c), (d) The number of rays intersecting each voxel for the 
firing orders above. The second sequence ((b), (d)) has more 
uniform coverage of the cyhnder and produces a more 
uniform distribution of rays. 


can employ the standard techniques of numerical lin- 
ear algebra and linear inverse problems, including iter- 
ative solution methods and regularization. The matrix 
of the system we solve is generally too large for direct 
solution, but one method that works well for this prob- 
lem is conjugate gradient least squares applied to the 
generalized Tikhonov regularized system. 

This approach allows us to apply a systematic choice 
of regularization penalty, equivalent to an assump- 
tion about the covariance of the prior distribution in 
Bayesian terms. The matrix of the system we solve is 
generally too large for direct solution with the com- 
puting hardware available, but looking at scaled-down 
systems we have found that the condition number of 
the (unregularized) matrix is smallest for firing orders 
that uniformly sample space. Another way to assess the 
merits of a firing order is to look at the distribution of 
directions of rays intersecting each voxel, and this crite- 
rion leads to the same conclusion as the studies of con- 
dition number (see figure 2). The resulting reconstruc- 
tion results also demonstrate an improvement over the 
lattice source firing. 
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(b) 



Figure 3 (a) Optimal two-sheet surface with an exaggerated 
axial scale, (b) Reconstruction of a bass guitar from data 
collected using a prototype RTT80 baggage scanner. 


Even using contemporary graphics processing units, 
iterative methods are too slow to solve the reconstruc- 
tion problem in real time. Rebinning methods interpo- 
late data from a (multithreaded) helix to approximate 
data taken on a plane. Radon transform inversion on 
the planes can be performed very quickly using filtered 
back projection. A variation of this method is surface 
rebinning. For each source location A the image values 
on a surface z = lT,\(x,y) (where z is the axial direction) 
are approximated using two-dimensional Radon trans- 
form inversion of the data corresponding to rays close 
to that surface (see figure 3). For conventional helical 
CT, the optimal surface to use is close to a plane, but for 
the RTT geometry a good solution is to find a surface 
with two sheets and reconstruct the sum of the atten- 
uation coefficients at points on those sheets with the 
same ( x,y ) coordinates. There is then a separate very 
sparse linear system to solve to find the values of the 
attenuation coefficient at a voxel, which will depend on 
values on the upper and lower sheet for different source 
positions. The optimal surface for a given geometry is 
conveniently found as a fixed point of a contraction 
mapping. 

Further Reading 

Betcke, M. M., and W. R. B. Lionheart. 2013. Multi-sheet sur- 
face rebinning methods for reconstruction from asym- 
metrically truncated cone beam projections. I. Approxi- 
mation and optimality. Inverse Problems 29:115003. 
Ramm, A. G., and A. I. Katsevich. 1996. The Radon Trans- 
form and Local Tomography. Boca Raton, FL: CRC Press. 
Thompson, W. M., W. R. B. Lionheart, and D. Oberg. 2013. 
Reduction of periodic artefacts for a switched-source 
X-ray CT machine by optimising the source firing pat- 
tern. In Proceedings of the 12th International Meeting on 
Fully Three-Dimensional Image Reconstruction in Radiol- 
ogy and Nuclear Medicine, Lake Tahoe, CA. 


VII.20 Mathematical Economics 

Ivar Ekeland 


Economic theory is traditionally divided into micro- 
economics and macroeconomics. Microeconomics deal 
with individuals, macroeconomics with society. At the 
present time, microeconomics is seen as foundational, 
meaning that society is understood as no more than 
a contract (implicit or explicit) between individuals, 
so that macroeconomics should be derived from the 
behavior of individuals, just as all the laws of physics 
should be derived from the behavior of atoms. Let me 
hasten to add that we are no closer to this grand uni- 
fication in economics than we are in physics. Macro- 
economics and microeconomics are largely separate 
fields, with the latter being much more conceptually 
mature, whereas the basic principles of the former are 
still under discussion. 

1 Individuals 

Individuals are seen as utility maximizers. Each of us 
lives in an environment in which decisions are to be 
made: choosing a point x e A, say, where A is a closed 
subset of Euclidean space. My preferences are charac- 
terized by a continuous function U : A -> R : I prefer x to 
yiIU(x) > U(y), and I am indifferent if U (x) = U(y). 
Note that the preference relation thus defined is tran- 
sitive: if I prefer x to y and y to z, then I prefer x to z. 
As we shall see later on, this transitivity holds only for 
individuals, not for groups. The utility function charac- 
terizes the individual: I have mine; you have yours; they 
are different. Note that it is taken as given that some 
of us are selfish, some are not, some of us are drug 
addicts, some of us prefer guns— it will all show up in 
our utility. At this point, economic theory is positive, 
meaning that it does not tell people what they should 
aim for; it tells them how best to reach their goals. 

It is assumed that each individual chooses the point 
that maximizes his or her utility function U over the 
set of possible choices A. Of course, we need this 
maximizer to be unique; otherwise we would have to 
choose between all possible maximizers, and the deci- 
sion problem would not be solved. In order to achieve 
uniqueness, the set A is assumed to be convex and 
the function U concave, one of them strictly so. The 
admissible set A is usually defined by budgetary con- 
straints. For instance, we may consider an economy 
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with D goods, with x = (x 1 x D ) being a goods bun- 
dle (meaning that x d is the quantity of good d in the 
bundle). If the price of good d is Pd, the price of the 
bundle x is given by 

D 

PX = X PdX d . 
d = 1 

Consumers are then characterized by their utility 
function U and their wealth w, and they will choose 
the bundle x that they prefer among all those they 
can afford. This translates into the convex optimization 
problem 


maxU(x), px ^ w, Xd ^ 0 Vd. (1) 


Strange as it may seem, this model has testable con- 
sequences, and the data supports it. It is worth going 
into this in a little more detail. One first needs to distin- 
guish what is obser\'able from what is not. The utility 
function U, for instance, cannot be observed: if I am 
asked what my utility function is, I do not know how 
to answer. However, the map p -> x(p) (the demand 
function) that associates with each price system the 
corresponding minimum in problem (1) is observable: 
it is conceivable that my consumption pattern could be 
observed under price changes. The demand function is 
known to satisfy the so-called Slutsky relations. These 
are a system of first-order partial differential equations 
(PDEs) that can be expressed compactly by stating that 
the matrix S(x) defined by 


S(x) m ’ n 


dx m 

dp n 



dx n 

w k pk 


is symmetric and negative-definite. Browning and Chi- 
appori have tested the Slutsky relations on Canadian 
data and found them to be satisfied, so the con- 
sumer model stands. Unfortunately, the model is too 
restricted: it is not enough to weigh one decision 
against another when the consequences are immediate 
and certain. Most of the time, the consequences will 
occur more or less far into the future, and they are 
affected by various degrees of uncertainty. 

The standard way to take into account uncertainty 
goes back to von Neumann and Morgenstern (VNM). 
Assume that uncertainty is modeled by a space Q of 
events, with a cr-algebra JA, and that the individual 
(who already has a utility function U) puts on (D,JA) 
a probability P. The utility of a random variable X will 
then be its expected utility, namely, E[[/(X)]. In this 
framework, the concavity of U can be interpreted as 
risk aversion. 


The standard way to account for late consequences 
is to discount future utilities. More precisely, it is 
assumed that there is some 5 > 0 (the psychological 
rate of time preference) such that if x is consumed 
at time t ^ 0, the resulting utility at time t = 0 is 
e~ st U(x). Combining both VNM utility and discount- 
ing, we can give the present value of an uncertain 
consumption flow (a stochastic process) X t , t ^ 0, as 

e[Jo e~ 5 t U(Xt) dtj . (2) 

One more equation is needed to describe how the 
consumption flow X t is generated or paid for. With 
this model one can then describe how people allo- 
cate consumption and investment over time. It is the 
workhorse of economic theory (both micro and macro) 
and finance. The mathematical tools are optimal con- 
trol, both stochastic and deterministic, both from the 
PDE point of view (leading to the Hamilton-Jacobi- 
Bellman equation) and from the point of view of the 
Pontryagin maximum principle (leading, in the stochas- 
tic case, to a backward stochastic differential equation). 
All economics textbooks are replete with examples. 

This model is now facing criticism due to accumu- 
lated psychological evidence. 

On the one hand, people do not seem to maximize 
expected utility when facing uncertainty, e.g., they give 
undue importance to events with a small probability, 
and they are more sensitive to losses than to gains. 
Allais (very early on) and (later) Ellsberg have pointed 
out paradoxes, that is, experiments in which the actual 
choices could not be explained by VNM utilities. Var- 
ious alternative models have been suggested to take 
these developments into account, the most popular of 
which is the prospect theory of Kahneman and Tver- 
sky. This is a modification of the VNM model in which 
probabilities are distorted before the expectation is 
computed. 

On the other hand, people do discount future utilities 
but not at a constant rate: the present value of receiv- 
ing x at a time t is h(t)U(x), where h{t) is decreasing 
from 1 to 0, but there is no reason it should be an expo- 
nential, h(t) = e~ St . Psychological evidence suggests 
that it looks more like h(t) = (1 + kf) _1 (hyperbolic 
discounting). This means that (2) should be replaced 
by 

e[J q h(t)U(Xt) df j . 

Unfortunately, this implies that the preferences of the 
decision maker now depend on time! More precisely, 
suppose that different flows ( X t and Yt, say) are given 
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for t ^ T (assume they are deterministic; uncertainty 
has nothing to do with it), and for 0 ^ s ^ T set 

I S (X) := J h(t-s)U(X t )dt. 

If h{t) is not an exponential, it may well be the case that, 
for si < 52 < T, we have I Sl (X) < I Sl {Y) but I S2 (X) > 
h 2 (Y). This phenomenon is called time inconsistency. 
It implies that seeking an optimal solution is useless, 
since the decision maker will change his or her idea of 
optimality as time goes by. 

Non-VNM utilities and nonexponential discounts are 
both the subject of active research because they seem 
to stick closer to the actual behavior of economic 
agents, notably investors, than do VNM utilities. This 
creates major mathematical challenges, since it intro- 
duces nonconvexities in optimization and control prob- 
lems (prospect theory), and challenges the very concept 
of optimality (time inconsistency), which then has to be 
replaced by a Nash equilibrium of the game between 
successive decision makers. 

2 Groups 

The main thing to understand about groups is that they 
are not individuals; in particular, they do not have a util- 
ity function. If there are N members in a group, each 
of them with a utility function U n , 1 ^ n ^ N, and a 
decision has to be made, then one cannot maximize all 
the U n at the same time, except in very particular cases. 
Each member has his of her own preferred choice x n , 
and the collective choice will be the result of a decision 
process. There is a large literature on collective choice, 
which is mainly axiomatic: one specifies certain prop- 
erties that such a process should satisfy, and then one 
seeks to identify the solution, if there is one. The main 
result in this direction is the Arrow impossibility the- 
orem, which states the following. Suppose there is a 
procedure that transforms any set of individual rank- 
ings into a collective ranking. Suppose this procedure 
satisfies two very mild conditions, namely: 

• if every voter prefers alternative X over alternative 
Y, then the group prefers X over Y; and 

• if another alternative Z is introduced and if ev- 
ery voter’s preference between X and Y remains 
unchanged, then the group’s preference between 
X and Y will also remain unchanged (even if vot- 
ers’ preferences between other pairs, like X and Z, 
Y and Z, or Z and W, change). 


The procedure then consists of conforming in every 
circumstance to the ranking of a fixed member of the 
group (the dictator). The importance of Arrow’s the- 
orem consists of emphasizing that in any situation 
where a collective decision is to be made, the result 
will be as much a function of the procedure chosen as 
of the individual preferences of the members. In other 
words, if the premises of microeconomic theory are 
to be accepted, there is no such thing as the common 
good. In any society there are only individual interests, 
and as soon as they do not agree, there is no overriding 
concern that would resolve the conflict; there are only 
procedures, and one is as good as another. A rich liter- 
ature on collective choice has sprung from this well. It 
is axiomatic in nature, i.e., one seeks procedures that 
will satisfy certain axioms posited a priori. For instance, 
one might want to satisfy some criterion of fairness. 
Unfortunately, there is not currently a formal model of 
fairness or a clear understanding of distributive justice. 

However, there is a generally accepted notion of effi- 
ciency. Choose coefficients A„ ^ 0, with X A„ = 1, and 
maximize X A n U n ix). The point x obtained in this way 
will have the property that one cannot find another 
point y such that U n (y ) > Unix) for some n, and 
U n iy) ^ U n ix) for all n. It is called a Pareto optimum , 
and of course it depends on the choice of the coeffi- 
cients A n - If x is not a Pareto optimum, then resources 
are being wasted, since one could give more to one per- 
son without hurting anyone else. Note that if x is a 
Pareto optimum, the choice x is efficient, since there 
is no waste, but it need not be fair: taking Ai = 1 and 
A n = 0 for n'y 0 gives a Pareto optimum. 

Groups are ubiquitous in economic theory: the small- 
est economic unit is not the individual but the house- 
hold, two or more people living together and shar- 
ing resources. How the resources are shared depends 
on cultural, social, and psychological constraints, and 
these would be extremely hard to model. However, 
one can get a long way with the analysis by making 
the simple assumption that the sharing of resources 
within the group is efficient, i.e., Pareto optimal. For 
an efficient household, composed of H members, each 
one with his/her own utility function Uhixh,X), the 
consumption problem (1) becomes 


max] ^ A h(p)Uh(x h ) 

L h 


px ^ w 


}' 


where xj, is the consumption of each member, p is 
the price system, and the A;, are the Pareto coefficients 
that reflect power within the group (and depend on the 
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price system). The econometrician observes the sum 
~E.Xk(p), that is, he or she observes the aggregate con- 
sumption of the group, not the individual consump- 
tion of each member. It is possible to extend the Slut- 
sky relations to this situation and identify the individ- 
ual preferences of each member. This is done by using 
exterior differential calculus, notably the Darboux and 
Cartan-Kahler theorems. 

3 Markets 

Markets are groups that have chosen a particular way to 
allocate resources. The standard model for competitive 
markets is due to Arrow and Debreu. There are global 
resources, namely a bundle of goods x e R D , that 
should either be shared outright (an exchange econ- 
omy) or used to produce more goods, which will then be 
shared (a production economy) among the N members 
of the group, each of whom has his or her own util- 
ity function U n ■ There is also a production technology, 
consisting of K firms, each of which is characterized by 
a production set Y^: if (x,y) e Y k, then firm k can pro- 
duce the bundle x by consuming the bundle y. Time 
runs from t = 0 to infinity, and there is some uncer- 
tainty about the future, modeled by a set Q of possible 
states of the world. 

Each good is characterized by its date of delivery 
t ^ 0 and is contingent on some state of the world 
or. I pay now, and I get the good delivered at time t if 
to occurs (so, in effect, one can insure oneself against 
any possible event). We start from an initial allocation 
(sharing) of the global resources: each individual n in 
the economy starts v\dth a bundle x n e R°, so that 
Xn x n = x. However, this initial allocation may not be 
Pareto optimal, hence the need for trading. The main 
result of Arrow and Debreu states that, provided utili- 
ties are concave and production functions are convex, 
there is an equilibrium, namely, a price system such 
that all markets are clear (demand equals supply for 
each good). The main mathematical tool used to prove 
the Arrow-Debreu result is the Brouwer fixed-point the- 
orem, usually in its multivalued variant, the Kakutani 
fixed-point theorem. Of course, the equilibrium that is 
reached depends on the initial allocation. 

Note that the theory is not entirely satisfactory, even 
within its own set of assumptions. On the one hand, 
there may be several equilibria, and the theory has 
nothing to say about which one is chosen. On the other, 
the theory is purely static: it does not say how an equi- 
librium is reached from a nonequilibrium situation. 


Mathematicians have expended much effort on find- 
ing dynamics that would always converge to an equilib- 
rium, and economists have tried to find procedures by 
which market participants would infer the equilibrium 
prices, but progress has so far been limited. 

4 How Markets Fail 

The idea that markets are always right has permeated 
all economic policy since the 1970s, and it inspired 
the wave of deregulation that has swept through the 
U.S. and U.K. economies since the Reagan era. Eco- 
nomic theory, though, does not support this view. The 
strength of the market mechanism lies in the first and 
second welfare theorems, which state that every equi- 
librium is an efficient allocation of resources and that, 
conversely, every efficient allocation of resources can 
be achieved as a no-trade equilibrium (if it is chosen 
as the initial allocation, it is also an equilibrium allo- 
cation for an appropriate price system). However, the 
conditions required for these theorems to hold, and 
indeed for all of Arrow-Debreu theory, are very rarely 
achieved in practice; in fact, I cannot think of a single 
real-world market in which they are. A huge part of eco- 
nomic theory is devoted to studying the various ways 
in which markets fail and to how they can be fixed. 

First of all, markets may exist and not be competitive. 
All major markets (weapons, oil, minerals, cars, phar- 
maceutical firms, operating systems, search engines) 
are either monopolistic or oligopolistic, meaning that 
producers charge noncompetitive prices, so the result- 
ing allocation of resources is not efficient. In addition, 
Arrow-Debreu theory eliminates uncertainty by assum- 
ing that one can insure oneself against any future event, 
and that is not the case. Note also that classical eco- 
nomic theory assumes that consumption goods are 
homogeneous (a Coke is a Coke is a Coke), whereas 
many actually differ in subtle ways (Romanee-Conti is 
not just another wine, and the 2000 vintage is not the 
same as the 2001 vintage). Prices then have to take 
quality into account, and consumers tend to behave 
differently: if you become richer, you do not necessar- 
ily buy more wine, but you certainly buy better wine. 
Such markets are called hedonic, and interesting con- 
nections have been found with the mathematical theory 
of optimal transportation. 

Second of all, markets do not price externalities. An 
externality occurs when the fact that I consume some- 
thing affects you. For instance, the competitive market 
price for cigarettes would take into account the demand 
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of smokers and the supply of tobacco manufacturers 
but not the inconvenience to nonsmokers nor the dam- 
age to their health, resulting in an inefficient alloca- 
tion of resources. To restore efficiency, one would have 
to charge the smokers for smoking and transfer the 
money to nonsmokers, a nonmarket mechanism. Mar- 
kets do not price public goods either. A public good is, 
according to an old saying, “like a mother's love: every- 
one has a share, and everyone enjoys it fully.” More 
precisely, it is a good that you can consume even if 
you did not pay for it or contribute to producing it. 
Sunshine, clean air, and a temperate climate are public 
goods: even the worst polluter can enjoy them. Other 
goods, like roads, education, or health, can be either 
public or private: if they are free, they are public goods; 
if you put a toll on a road, if you charge tuition fees, or 
if you have to pay for medical treatment, they become 
private goods. The problem with making them private 
is that they have large positive externalities (the road 
supports economic activity, having an educated work- 
force is good for business, living with healthy individu- 
als reduces one’s chances of falling ill) that the market 
will not price, and the resulting situation will therefore 
be inefficient. 

5 Asymmetry of Information 

Another imperfection of the Arrow-Debreu model is 
that there is full information on the goods that are 
traded: a loaf of bread is a loaf of bread; there can 
hardly be a surprise. However, there are many situ- 
ations in which information is asymmetric: typically, 
the seller, who already owns the good, or who has 
produced it, knows more about it than the buyer. Is 
that important? For a long time it was thought that 
only marginal adjustments to the Arrow-Debreu model 
would be required to take care of asymmetry of infor- 
mation, but then a seminal 1970 paper by Akerlof 
showed that, in fact, this asymmetry is so important 
that the Arrow-Debreu model breaks down entirely. 

The Akerlof example is so important that I will sketch 
it here. Consider a population of 2 N people. Half of 
them have a used car and want to sell it; the other half 
do not have a car and want to buy one. So there are N 
cars for sale. The quality of each car is modeled by a 
number x. A car of quality x is worth x to the seller, 
and |x to the buyer. 

The Arrow-Debreu situation is the case when buyer 
and seller both know the value of x. Cars of different 
quality will then be considered as different goods and 


traded at different prices: quality x will trade at any 
price p e [x, |x]. The market then functions and all 
cars are sold. However, this is not realistic: as we all 
know, the seller, who has used the car for some time, 
knows more about the quality than the buyer, who wall 
only get to drive it around the block. Suppose, then, that 
the seller knows x but all the buyer knows is its proba- 
bility law, say that x is uniformly distributed between 0 
and 1. Since buyers cannot distinguish between them, 
all cars must have the same price p. If the buyer buys at 
that price, the expected quality he gets is x = \p, which 
to him is worth | p < p, which is less than he paid. So 
no one will buy, at any price, and the market simply 
does not function: there are people willing to buy, peo- 
ple willing to sell, both would be better off trading, and 
yet trade does not occur. 

The lesson from the Akerlof example is that compet- 
itive markets cannot function when there is asymme- 
try of information. Since that paper, economists have 
sought alternative procedures to allocate resources, 
but at the time of writing, after many years of effort, 
dealing with groups of three or more agents at once 
under asymmetry of information is still beyond reach. 
The standard model therefore deals with only two: the 
principal and the agent. The agent has information that 
the principal does not have, but the principal knows 
that the agent has that information. Trading proceeds 
as follows: the principal makes an offer (the contract) 
to the agent, and the agent takes it or leaves it (no 
bargaining). 

Contract theory is divided into two main branches: 
adverse selection and moral hazard. Adverse selection 
occurs when the agent has a characteristic that he or 
she knows but can do nothing about and that the prin- 
cipal would dearly like to know but does not. This is 
typically the case in insurance, but it occurs in other 
situations as well. If I am looking for health insurance, 
insurers would like to know as much as possible about 
my medical condition and my genetic map, which may 
be precisely what I do not want them to know. Air- 
lines provide basically the same service to all passen- 
gers (leaving from point A at time t and arriving at 
point B at time T), and they would like to know how 
much each passenger is willing to pay. Since they can- 
not access that information, they separate economy 
class from business class in the hope that customers 
will sort themselves out. Moral hazard occurs when the 
principal contracts the agent to perform an action that 
he or she, the principal, cannot monitor. The contract 
then has to be devised in such a way that the agent finds 



VU.21. Mathematical Neuroscience 


873 


it in his or her own interest to comply, even if noncom- 
pliance cannot be observed and therefore cannot be 
punished. This is typically the case in the finance indus- 
try. If I entrust my wealth to a money manager, how am 
I to know that he or she is competent, or even that he 
or she works hard? The track record of the financial 
industry is not encouraging in that respect. The solu- 
tion, if you can afford it (this is what happens in hedge 
funds, for instance), is to give the money manager such 
a large part of your earnings that your interests become 
aligned with theirs. It would be even better to have 
him or her share the losses, but this is prohibited by 
legislation (limited liability rules). 

Informational asymmetry has spurred many math- 
ematical developments. Adverse selection translates 
into problems in the calculus of variations with con- 
vexity constraints. A typical example is the following, 
due to Rochet and Chone. Given a square Q = [a, b] 2 , 
with 0 < a < b, and a number c > 0, solve the problem 

min J (jcVu 2 -xVu + u) d.v, 
u G H 1 , u ^ 0, u convex. 

If it were not for the convexity constraint, this would 
be a straightforward obstacle problem. As it is, we have 
existence, uniqueness, and C 1 -regularity, but we do not 
yet have usable optimality conditions. 

For a long time, there was no comparable progress on 
moral hazard problems, but in 2008 there was a break- 
through by Sannikov, who showed how to solve them 
in a dynamic setting. His basic model is quite simple. 
Consider a stream of revenue 

dA't = a t dt + a d B t , 

where cr is a constant and B t is Brownian motion. The 
revenue accrues to the principal, but at depends on 
the agent. More precisely, at each time t the agent 
decides to perform an effort at Is 0, at a personal cost 
h(a t ), with h( 0) = 0 and h increasing with a. The 
effort is unobservable by the principal; all he or she 
can observe is dX t , so that the agent’s effort is hidden 
in the randomness. The problem is then for the prin- 
cipal to devise a compensation scheme for the agent, 
depending only on the past history of X t , that will maxi- 
mize the principal’s profit. Sannikov solves it by a clever 
use of the martingale representation theorem, thereby 
opening up a new avenue for research. 

6 Conclusion 

I hope this brief survey has shown the breath of math- 
ematics used in modern economic theory: optimal 


control, both deterministic and stochastic; diffusion 
theory and Malliavin calculus; exterior differential cal- 
culus; optimal transportation; and the calculus of varia- 
tions. Economic theory is still trying to assimilate all the 
consequences of informational asymmetry. This rev- 
olution is far from over; the theory will probably be 
very different in 2050 from what it is today. These are 
exciting times. 
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VII.21 Mathematical Neuroscience 

Bard Ermentrout 


1 Introduction 

Mathematical neuroscience is a branch of applied math- 
ematics and, more specifically, an area of nonlinear 
dynamics concerned with the analysis of equations that 
arise from models of the nervous system. The basic 
units of the nervous system are neurons, which are cells 
that reside in the brain, spinal cord, and throughout 
many other regions of the bodies of animals. Indeed, 
the presence of neurons is what separates animals from 
other living organisms such as plants and fungi. Neu- 
rons and their interactions are what allow organisms 
to respond to a rapidly changing and unpredictable 
environment. The number of neurons in an organism 
ranges from just a few hundred in the nematode (a 
small worm) to around 100 billion in the human brain. 
Most neurons are complex cells that consist (roughly 
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speaking) of a cell body, an axon, and dendrites. A typ- 
ical neuron will generate transient changes in its mem- 
brane potential on the order of 100 millivolts (called 
action potentials) that communicate electrochemically 
with other neurons through synapses. Neurons are typ- 
ically threshold devices, so that if their membrane 
potential exceeds a particular value, they will “fire” an 
action potential. The synapses from one neuron can 
be either excitatory or inhibitory: either they promote 
action potentials in the receiving neuron or they pre- 
vent them. There are about 100 trillion synapses in the 
human brain. The properties of individual neurons are 
reasonably well characterized as are the dynamics of 
individual synapses. However, the behavior of even a 
single neuron can be very complex and so, of course, is 
the behavior of even a small piece of brain. 

The cell membrane of a neuron is what separates 
the neuron from the surrounding medium, and it is 
through this membrane that the complex firing pat- 
terns of neurons are mediated. Various ions such as 
sodium, potassium, calcium, and chloride are con- 
tained within the neuron and in the space between the 
neurons. The membrane is dotted with thousands of 
proteins called channels that selectively allow one or 
more ions to pass into and out of the cell. By controlling 
the instantaneous permeability of specific channels, 
the cell is able to create large transient fluctuations 
in its potential (the above-mentioned action potential). 
While perhaps not at the level of Newton’s laws or the 
Navier-Stokes equations for fluids, there are standard 
models for the neuronal membrane, for the propaga- 
tion of action potentials down axons, for the dynamics 
of synapses, and for how all this information is inte- 
grated into the cell body. Mathematical neuroscience 
analyzes the dynamics of these models using various 
tools from applied mathematics. 

2 Single-Cell Dynamics 

The modeling of a single neuron ranges from a simple 
scalar ordinary differential equation (ODE) to a system 
of many coupled partial differential equations (PDEs). 
For simulation purposes, the latter model is simplified 
to a large number of coupled ODEs representing the 
different parts of the neuron, which are called compart- 
ments. For example, the soma or cell body (see figure 1) 
might be represented by one compartment, while the 
dendrites and possibly the axon might be broken into 
dozens of compartments. A neuron represented by a 
single compartment is called a point neuron and that 


is where we will focus our attention. A point neuron 
is modeled by ODEs representing the permeabilities 
of one or more channels whose properties have been 
experimentally measured. The easiest way to think of 
a single compartment is through an equivalent circuit 
(see figure 1) defined by the transmembrane voltage, 
V(t), the membrane capacitance, C, and the conduc- 
tances of the various channels, gj(t). The first neuronal 
membrane to be mathematically modeled was that of 
the squid giant axon, for which Hodgkin and Hux- 
ley received a Nobel Prize. The differential equations 
represented by the figure have the form 

C ^T = - 3 Kn 4 (V-E K )-g L (V-E L ) 
at 

- gmrn 3 h(V - E N a ) +I(t), 


d n 

tioo(V) - n 

d t 

T n(V) 

dm 

ntco(V) - m 

d t 

T m (V) 

dh 

hco(V) - h 

dt 

T h(V) 


Here, I(t) represents either the electrical current that 
the experimenter can inject into the neuron or the cur- 
rent from synapses from other neurons (see below). 
The functions (V), etc., were measured through very 
clever experiments and then fitted to particular func- 
tions. The constants gi, El, etc., were also experimen- 
tally determined. While Hodgkin and Huxley (HH) did 
this work nearly sixty years ago, the basic formulation 
has not changed, and the thousands of papers devoted 
to the modeling of channels and neurons follow this 
general format: for each compartment in the represen- 
tation of the neuron, there is a voltage, V(t), and auxil- 
iary variables such as n(t), m(t), h(t) whose time evo- 
lutions depends on V. One goal in mathematical neuro- 
science is to characterize the dynamics of these nonlin- 
ear ODEs as some of the parameters vary. In particular, 
a typical experiment is to inject a constant current, J, 
into the neuron and to then look at its dynamics. It is 
also possible to pharmacologically block or reduce the 
magnitude of the channels, so that the parameters ^k, 
^Na, etc., may also be manipulated. The most common 
way to analyze the dynamics of equations like (1) is to 
compute their fixed points and then study their stability 
by linearizing about a steady state. For systems like the 
HH equations, the equilibria are easy to find by solving 
for the variables m,h,nas functions of voltage, obtain- 
ing a single equation of the form / = F(V), where / is 
the constant current. Plotting V against I allows one 
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Figure 1 Parts of neurons. Three neurons are shown, with 
the input (dendrites), cell body, and output (axon) labeled 
on one neuron. Dashed arrows show the direction of infor- 
mation flow. The two insets show a magnified schematic of 
the cell membrane and its equivalent electrical circuit. 

to identify all the equilibria. Whether or not the equi- 
libria are stable is determined by the eigenvalues of 
the 4x4 Jacobian matrix. This procedure allows one 
to find bifurcations [IV.21] to limit cycles through 
the Andronov-Hopf bifurcation. In general, numerical 
methods are used to find the solutions and their sta- 
bility. Figure 2 shows the diagram for the HH equa- 
tions as the applied current varies. There seems to be 
only a single equilibrium point, which loses stability via 
an Andronov-Hopf bifurcation and spawns a family of 
periodic orbits. The stable periodic solutions represent 
repetitive firing of the neuron. Note that there is a range 
of currents (shown by the dashed lines) for which a sta- 
ble equilibrium and a stable periodic orbit coexist. This 
bistability was experimentally confirmed in the 1970s. 

The other common approach to the analysis of equa- 
tions like (1) is to exploit the differences in timescales. 
For example, the dynamics of V, m are very fast com- 
pared with the dynamics of n, h, so it becomes pos- 
sible to use singular-perturbation methods to analyze 
the dynamics. Because of this difference in timescales, 
even the simple HH model can show complex dynam- 
ics. For example, figure 2 shows that reducing the rate 
of change of h by half results in a completely dif- 
ferent type of dynamics (compare the top and bot- 
tom plots of the voltage, V). By treating this as a 
singularly perturbed system, it is possible to explain 
the complex spiking patterns seen in the HH model. 




(c) 



Figure 2 (a) A bifurcation curve for the HH model as 
the applied current varies, showing stable equilibria (SE), 
unstable equilibria (UE), stable and unstable periodic orbits 
(SPO/UPO), and the Hopf bifurcation (HB). (b) Small changes 
in the rate of change of h have drastic effects on the dynam- 
ics. (c) Bursting behavior in a simple model where V(t) is 
the fast voltage dynamics (black line) and Ca(t) is the slow 
calcium dynamics (represented by the gray line). 
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Singular perturbation also plays an important role in 
the study of so-called bursting dynamics, where the 
voltage shows periods of high-frequency spiking fol- 
lowed by periods of quiescence; an example of this 
can be seen in part (c) of the figure. To understand 
the dynamics of, say, the bursting, we exploit the 
differences in timescales to write the system as 
dx d y 

dt =f(x ’ y) ’ d T = £9(x ' y) ' 

where x represents the fast variables (typically, voltage 
and possibly other variables) and y represents the slow 
variable(s) (typically, a slow current). One then looks at 
the dynamics of x treating y as a constant parameter. 
This leads to parametrized dynamics, x(t;y), which 
is then substituted into the y equation before aver- 
aging is performed (if x is time-periodic) to resolve 
the dynamics of y. If the x dynamics is bistable, say, 
between a limit cycle and a fixed point, then, as y 
slowly changes (at the timescale ft), the x dynamics 
can vary between equilibrium behavior and oscillations, 
thus producing the burst. Figure 2(c) shows the slow 
dynamics y (the thicker gray line, representing the slow 
change in calcium concentration of the model) and the 
fast dynamics x (the thinner black line, representing 
voltage) for a simple bursting model. 

3 Networks 

The complexity of the dynamics of even a simple cell 
dictates that some type of simplification is necessary 
in order to study large networks of neurons. There 
are two commonly taken paths: either one uses sim- 
pler models for individual neurons or one takes “mean- 
field” approaches, where neurons are approximated by 
an abstract variable called the firing rate, or activity. 
In the former case, a classic model for simplifying the 
dynamics of a single neuron is the leaky integrate-and- 
fire model, in which tV' = -V + Vo, along with the 
condition that, if V(t~) = Vi, then V(t + ) = V 2 , where 
Vi > Vz. Each reset event results in a simulated action 
potential or spike of the neuron. (While the dynamics 
of this class of models is quite simple, their discontin- 
uous nature presents many mathematical challenges.) 
With this model, a general network has the form 

t = -Vt + Ui{t) - X mjitHVi - Ej), (2) 
j 

where Ej, Tt are constants, (7/(t) are inputs (often con- 
sisting of broadband random signals), and w/y(t) is 
some prescribed function of the time, t, since neuron j 
fired that decays to zero for large positive t and is zero 


for t < 0. For example, icy(t) = W^it - tj) + exp (-(t - 
tj ) + ) , where x + is the positive part of x and Wij is a 
constant. If Ej < 0 (Ej > 0), then the synapse is called 
inhibitor y (excitatory*). If iVij(t) depends only on i - j, 
then we can regard the network as being spatially dis- 
tributed. In this case, various patterns such as traveling 
waves and spatially structured dynamics are possible. 
The analysis of these networks is difficult, but meth- 
ods from statistical physics have been applied when 
the number of neurons becomes large and there is no 
structure in the connectivity. With certain assumptions 
about the connectivity patterns and strengths, the fir- 
ing of individual neurons in (2) is chaotic and the corre- 
lations between the firing times of neurons are nearly 
zero. This state is called the asynchronous state, and in 
this case it makes sense to define a population firing 
rate, r(t). As the number of neurons in a given popu- 
lation (say, excitatory cells) tends to infinity, the pop- 
ulation rate, r(t)At, is defined as the fraction of the 
population that has fired an action potential between t 
and t + At. One of the important questions in theoret- 
ical neuroscience is to relate r (t) to the dynamics of the 
network of complex spiking neurons and, thus, derive 
the dynamics of r(t). One can use various approaches 
to reduce scalar models such as (2) to a PDE for the dis- 
tribution of the voltages that can have discontinuities 
and has the form 

The firing rate is proportional to the flux through the 
firing threshold, V\. This PDE can be very difficult to 
solve; if there are several different populations, then 
the same number of coupled PDEs must be solved. 
Thus, there have been many attempts at arriving at sim- 
plified models for the rate equations. Here, we offer a 
heuristic description of a rate model for several inter- 
acting populations of neurons. Let I p (t) be the cur- 
rent entering a representative neuron from population 
p. If I p is large enough, the neuron will fire at a rate 
lip (t) = Fp(I p (t)) (see, for example, figure 2, which 
gives the firing rate of the HH neuron as the current 
changes). Each time a neuron from population p fires, 
it will produce current in population g, so that the new 
current into q from p is 

Jqp(t) = ^ ^ 11q p (t )U p (t — t ) dt . 

q qp is positive for excitatory currents and negative for 
inhibitory ones. Dale's principle states that the sign of 
q ap depends only on p. Summing up the currents from 
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all populations gives us a closed system: 

r 00 

= X rtq P (t’)Fp(I p (t-t'))dt'. (3) 
q JO 

Often, an equation for the rates is desired, so that 

u q (t) = 1qp(t')u p (t- t'ldt'j. 

In either case, the result is a set of coupled integro- 
differential equations. If the functions rj qp (t) are sums 
of exponentials, then (3) can be inverted to create an 
ODE. For example, if g qp (t) = W qp exp(-t/T q ) /T q , 
then (3) becomes 

( 4 ) 

UL p 

In the case where W qp is symmetric, it is easy to 
construct a Lyapunov function for this system and 
prove that all solutions converge to fixed points. Net- 
works such that the sign of W qp depends only on p 
are called excitatory'-inhibitory networks, and they also 
obey Dale’s principle. If W qp > 0 (W qp < 0), then popu- 
lation p is called excitatory (inhibitory). Networks with 
both excitatory and inhibitory populations have been 
shown to exhibit complex dynamics such as limit cycle 
oscillations and chaos. 

4 Spatially Extended Networks 

Equations of the form (4) have an obvious extension to 
spatially distributed populations: 

Te ~ = Wee(x) * F e(h(x,t)) 

-W e i(x)*Fi(h(x,t))-I e (x,t), (5) 

Ti dI ^dt^ = Wi ^ X ^ xS> * F e(h(X,t)) 

-Wn(x)*F l (I l (x,t))-I i (x,t), ( 6 ) 

where W(x) * U(x) is a convolution over the spatial 
domain of the network (often the infinite line or plane; 
or on a circle or torus), and I e , h are the respective exci- 
tatory and inhibitory currents for the neuron at posi- 
tion x. The functions W qp (x) generally depend only on 
\x\ and decay rapidly for large arguments. These mod- 
els have been used to explain spatiotemporal patterns 
of activity found in experiments. While the equations 
represent a great simplification, they have been quite 
successful in explaining various experimental findings. 

4. 1 Wavefronts 

When inhibitory circuits in the brain are damaged or 
pharmacologically blocked, activity can pathologically 


Space 



(a) (a) (c) 

Figure 3 Three examples of spatial activity in a population 
model: (a) a traveling wavefront, (b) a stationary pulse, and 
(c) spatially periodic activity. 

spread across your cortex. Experiments have shown 
that this activity takes the form of a traveling wave. 
A natural starting point for the study of equations of 
the form (5), ( 6 ) is, therefore, to consider models with 
no inhibitory population. In this case, it is possible to 
prove the existence of traveling fronts under certain 
circumstances. In particular, let w ee = j R W ee (x) dx 
and 

(i) suppose that g(u) = -u + w ee F e (u) has three 
positive roots u\ <u .2 < M 3 , 

(ii) let W ee (r) be a decreasing nonnegative function for 
r ^ 0 , and 

(iii) assume that ^'(mi, 3 ) <0 and g' ( 112 ) > 0 . 

There is then a unique constant-speed wavefront join- 
ing Mi with M3. This wave is analogous to the wave seen 
in the scalar bistable reaction-diffusion equation. Fig- 
ure 3(a) shows an example of a traveling wave when Wee 
is an exponentially decaying function of space. 

4.2 Stationary Pulses 

Suppose that you are asked to remember the location 
of a flash of light for a short period of time, and after 
that time you must move your eyes to the place where 
you recalled the light. This form of memory is called 
short-term or working memory. By making recordings 
of the activity in certain brain regions of monkeys while 
they perform such a task, experimentalists have shown 
that the neural correlate of working memory consists of 
persistent neural activity in a spatially localized region. 
In the context of the model equations (5), ( 6 ), work- 
ing memory is represented by a stationary spatially 
localized region of activity. Solutions to (5) can be con- 
structed with the assumptions that W e i(x) = 0, F e (I) 
is a step function, and W ee (x) has the lateral inhibi- 
tion property. That is, let q(x) = f 0 W ee (y)dy. The 
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lateral inhibition property states that qix) is increas- 
ing up to x = a > 0 and then decreases, approach- 
ing a finite value as x — ■ co. While it is possible 
to relax the assumptions on the shape of F to some 
extent, there is still not a general existence theorem 
for stationary pulses for this class of equations. Fig- 
ure 3(b) shows a simulation of (5) when iy ee (x) is 
the difference between two exponential functions. One 
can formally construct stationary pulse solutions to 
the full excitatory-inhibitory network, (5), (6), using 
singular-perturbation theory. 

4.3 Regular Patterns 

Numerous authors have suggested that the neural ana- 
logue of simple geometric visual hallucinations is spon- 
taneous spatially periodic activity in the visual cortex. 
The connectivity of the visual system is such that under 
certain circumstances, the spatially uniform solution 
loses stability due to spatially periodic perturbations 
in a restricted range of spatial frequencies. This phe- 
nomenon is generally known as a symmetry'-breaking 
bifurcation and, in biological contexts, a Turing bifur- 
cation. Figure 3(c) shows an example of such a bifurca- 
tion in equations (5), (6) in one spatial dimension. In the 
late 1970s bifurcation methods were applied to prove 
the existence of stable spatial patterns in two spatial 
dimensions that are the analogue of simple geometric 
visual hallucinations. 

5 Plasticity 

One of the hallmarks of the nervous system is that it 
is almost infinitely reconfigurable. If part of a region 
is lost due to injury, other regions will rewire their 
connections to compensate. The connection strengths 
between neurons are believed to be the physiologi- 
cal correlates of learning and memory. This ability 
to change over time and alter connection strengths 
is called synaptic plasticity, it can take several forms, 
over timescales that range from fractions of a sec- 
ond to decades. For simplicity, we can divide plas- 
ticity into short-term and long-term plasticity. To 
make things concrete, we consider two neurons A and 
B, with A sending the signal (the presynaptic neu- 
ron) and B receiving the signal (the postsynaptic neu- 
ron). Short-term plasticity typically involves the weak- 
ening or strengthening of connections in a usage- 
dependent manner and depends only on the presy- 
naptic activity. In contrast, long-term plasticity, pre- 
sumably responsible for learning and lifetime memory, 


depends on the activity of both presynaptic and post- 
synaptic activity. 

5.1 Short-Term Plasticity 

In short-term plasticity, only the activity of the presy- 
naptic neuron matters, and two types of phenom- 
ena occur: depression and facilitation. With depression 
(respectively, facilitation), the strength of the synapse 
weakens (respectively, strengthens) with each succes- 
sive spike of neuron A and then recovers back to base- 
line if there are no subsequent spikes. Let t* denote 
the time of a presynaptic spike and define two variables 
fit) and git) corresponding to facilitation and depres- 
sion, respectively. Let t y and be the respective time 
constants, fo and do the starting values, and ay and a d 
the degree of facilitation and depression. We then have 

= foyJf +a 6{t _ t * ){1 _ f) 
d t t f 

d| = doyg _ ad6{t _ t * )q _ 
df t d 

Typically, fo = 0, do = 1, with t L j,y of the order of 
hundreds of milliseconds. The strength of the synapse 
is then multiplied by g it) fit). For example, if there is 
no facilitation, then fo = 1, ay = 0, and if there is no 
depression, do = 1 , ad = 0. 

5.2 Long-Term Plasticity 

Long-term plasticity depends on the activity of both the 
presynaptic and postsynaptic neurons and is based on 
an idea first proposed by Donald Hebb that “neurons 
that fire together, wire together.” This type of Hebbian 
plasticity is believed to be responsible for both wiring 
up the nervous system as it matures and for the cre- 
ation of new long-term memories (in contrast to the 
working memory described above). If icba is the weight 
or strength of the connection of neuron A to neuron B 
and rx,B are the firing rates of the two neurons over 
some period, then, typically, 

- ^ = fl(rA _ 0 A )( rB _ p B ) + £2 (W ba - ii/ba), (7) 
dt 

with the additional proviso that, if the terms multiplied 
by e i are both negative, the term is set to 0. This model 
says that the weight will increase if both A and B are 
firing at sufficiently high rates. If A is firing at a high 
rate and B is firing at a low rate (or vice versa), then the 
weight will decrease. Learning equations of the form 
(7) are used to adjust the connectivities in rate models 
such as equation (4) and are capable of encoding many 
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“memories” in the strength of the weights. The term 
with £2 represents a tendency to decay to some baseline 
weight. 

Another form of Hebbian plasticity incorporates 
causality in the sense that the weight from A to B will 
be increased (decreased) if A fires slightly before (after) 
B. A typical equation for the synaptic weight takes the 
form 

= TUB “ tA)g(WBA) + £2(W BA - IUba), 

where 1 a, b are the times of the spikes of neurons A and 
B .F(t) is often an exponentially decaying function of 1 1 1 
and is negative for t > 0 (B fires before A) and positive 
for t < 0 (A fires before B). The function g(w) can be 
constant, linear in w, or of a more complicated form. 
This form of plasticity is called spike-time-dependent 
plasticity and allows a group of neurons to connect up 
sequentially and thus learn sequential tasks. 

6 Summary 

Mathematics plays a prominent role in the under- 
standing of the behavior of the nervous system. Meth- 
ods from nonlinear dynamics have been used to 
explain both normal and pathological behavior (such 
as epilepsy, Parkinson’s disease, and schizophrenia). 
Conversely, models that arise in neuroscience provide 
a wide range of problems that are awaiting rigorous 
mathematical treatment. 
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1 Introduction 

Biological organisms are complex interconnected sys- 
tems comprising an enormous number of components 


that interact in highly choreographed ways. These 
systems, which are constantly evolving, have been 
designed to carry out numerous tasks under diverse 
environmental conditions. Because of limitations in 
terms of the available experimental tools and technol- 
ogy, biologists have for many years focused primarily 
on how the individual components in these systems 
work. More recently, advancements in genomics, pro- 
teomics, metabolomics, and live tissue imaging have 
led to an explosion of data at increasingly fine spatial 
and temporal scales. In order to describe and under- 
stand our increasingly complex and interconnected 
view of the biological world, which is far more intri- 
cate than any physical system, an integrated systems 
approach is needed. 

Systems biology, which initially gained momentum 
in the 1990s, focuses on studying interactions between 
components that give rise to emergent behavior. In 
this view, a system is more than just the sum of its 
parts. To better understand such systems, an interdis- 
ciplinary approach that brings together experimental 
tools and theoretical methodologies from diverse dis- 
ciplines must be developed. In 2009 the US National 
Research Council published A New Biology for the 21st 
Century, which recommends a “new biology” approach 
and asks for “greater integration within biology, and 
closer collaboration with physical, computational, and 
earth scientists, mathematicians and engineers.” In 
many ways, this new biology is systems biology. 

In many cases, the goals and questions of interests 
in systems biology are different from those in clas- 
sical biology. First, in addition to asking “how” (e.g., 
“Flow does a transcriptional factor lead to a feedback 
loop between two regulators?”), one often asks “why” 
(e.g., “Why is such a loop useful and what is its bene- 
fit?”). Second, a primary goal of systems biology is to 
uncover common principles governing many different 
molecular mechanisms inferred from building blocks 
in diverse organisms. Toward this goal, biological com- 
ponents are scrutinized at different spatial or temporal 
scales to explore novel and irreducible emergent prop- 
erties, the biological equivalent of “first principles,” 
arising from their interactions. 

Mathematics through modeling plays a critical role 
in systems biology (figure 1). Useful models span 
the range from simple to highly complex. In some 
cases, simple models of two components may be suf- 
ficient to capture the key behavior of a network com- 
posed of thousands of components, whereas in other 
cases a twenty-component model might be required 
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Figure 1 Mathematics in systems biology. All arrows 
require various mathematical and computational tools that 
are critical in systems biology. 


to unravel complex dynamics and account for criti- 
cal features such as variability between gene prod- 
ucts. Furthermore, these pieces can either be concep- 
tual or they can be identified with specific biological 
components. These complexities introduce many chal- 
lenges for mathematics. Moving from a descriptive to 
a predictive model requires incorporation of massive 
experimental data to constrain parameters. Further- 
more, elucidating complex interactions to obtain laws 
and emerging properties often requires exploration of 
these models. The biological questions of interest dic- 
tate the choice of models, which in turn demand dif- 
ferent mathematical and computational tools. Here, we 
illustrate this systems biology approach and the math- 
ematics associated with it at three fundamental biologi- 
cal scales: gene, cell, and tissue. We also discuss its role 
in elucidating the function of an often-overlooked but 
important element that permeates all scales: noise. 

2 The Information Flows from and to the Gene 

The genetic information coded in deoxyribonucleic acid 
(DNA) needs to be processed into ribonucleic acid 
(RNA) and proteins in order to function in a cell. 
This process, in which DNA makes RNA and RNA 
makes protein, is referred to in biology as the cen- 
tral dogma. Through “transcription,” genetic informa- 
tion contained in parts of DNA is transferred to mes- 
senger RNA (mRNA). mRNA then interacts with a ribo- 
some, a complex molecular machine that builds pro- 
teins, to “translate” RNA into protein. Transcription 
and translation are two major steps for passing genetic 
codes to cellular functions. Transcription factors— a 
special group of proteins that facilitate the transcrip- 
tion process— are critical in linking one genetic code to 
another, in many cases creating loops of gene-to-gene 
regulation. Different genes make different proteins that 


may regulate various transcriptions to control produc- 
tion of other proteins. 

Modeling transcription and translation requires a 
mechanistic description of gene action and regulation 
from the sequences in regulatory regions of DNA. Sta- 
tistical physics provides a natural approach to examine 
how a gene is activated and repressed through tran- 
scription factors. Stochastic and probabilistic methods 
(e.g., STOCHASTIC DIFFERENTIAL EQUATIONS [IV. 14] and 
markov process models [11.25]) are the foundations 
for this approach. In a typical transcription model, all 
possible states of an enhancer with a statistical weight 
are first catalogued. The action of a gene, estimated 
through a probability, depends on whether a regula- 
tory region (e.g., binding site) is bound or unbound, 
and this probability is estimated as a fraction of the 
states bound to activators. Transcriptional regulations 
such as cooperativity, competitive interactions among 
transcription factors, inhibition of activators by repres- 
sors, and other epigenetic regulations (e.g., chromatin 
structure and modification, or DNA methylation) can 
be incorporated into this type of probabilistic model. 
This approach has been used to identify numerous reg- 
ulatory motifs, to elucidate the function of regulatory 
regions of DNA, and to predict novel targets in the 
genome. 

3 Signal Transduction into and out of the Cell 

The cell, a “building block” of life, is the smallest 
information-processing unit that is able to respond to 
environmental changes and make decisions on its own. 
Proteins outside of a cell, which often serve as a stim- 
ulus, can bind receptors on the cell membrane or be 
trafficked into the cell to initiate cascades of biochem- 
ical reactions [V.7]. Through these cascades, a spe- 
cific stimulus activates one or a group of genes that 
contribute to a specific program or task (e.g., cell divi- 
sion for replication or changing cell fates). This process 
of turning an external signal into an internal response 
is often called signal transduction. One such example 
is the MAPK/ERK pathway that transmits growth fac- 
tor stimuli through cell surface receptors to induce cell 
growth and division. Beyond understanding the pro- 
cess of signal transduction, the critical factors that 
lead to the pathway’s malfunction are also of inter- 
est since, in many cases, this leads to cancer and other 
disorders. 

Modeling signal transductions requires an accurate 
description of biochemical reactions for different types 


VII.22. Systems Biology 


881 


of biomolecules. When the number of molecules is 
relatively small (e.g., smaller than ten), one can use 
discrete, probabilistic approaches based on chemical 
master equations. When a relatively large number of 
molecules are present in a “well-stirred system,” chem- 
ical concentrations provide a good approximation of 
the numbers of molecules, and a continuum model can 
describe the system. Common, fundamental concepts 
in this approach include rate equations, Michaelis- 
Menten kinetics, and the law of mass action. These 
models lead to systems of ordinary differential equa- 
tions that are often stiff when drastically different reac- 
tion rates are present. Numerical simulations of such 
models require specialized algorithms of good absolute 
stability properties. 

Regulations often exist in these signaling pathways. 
Feedback represents interactions in which downstream 
signals influence upstream signaling elements, whereas 
feedforward regulation refers to connections that pass 
upstream signals directly to the downstream signal- 
ing components. Mathematically, feedback and feed- 
forward regulations can be described by Hill functions, 
which encode saturating tunable response curves. Sim- 
ilar to a thermostat that adjusts the influx of heat 
to stabilize the temperature around a prescribed one, 
negative feedback is often used to control homeosta- 
sis (i.e., stable steady state) in biology. In such a con- 
troller, two distinct timescales for a fast “detector” 
and a slow “reactor” are critical to maintaining stable 
homeostasis. When these two timescales become close, 
oscillatory dynamics arise, leading to periodic solu- 
tions. Positive feedbacks (e.g., excitation-contraction 
coupling in the heart, cellular differentiation, interac- 
tion between cytokines and immune cells) can amplify 
signals leading to bistable or ultrasensitive responses. 
In a bistable system one input can lead to two distinct 
outputs depending on the state of the system, com- 
monly referred to as hysteresis, bifurcation analy- 
sis [IV.21] (e.g., through pseudo-arc-length continua- 
tion methods) is a major tool for studying bistability. In 
feedforward regulations, a coherent loop has two par- 
allel paths of the same regulation type (both positive or 
both negative), whereas an incoherent loop has differ- 
ent types of regulations in the two pathways. Positive 
coherent loops can induce delay in response to an ON 
signal without putting any delay effect on the OFF sig- 
nal. On the other hand, incoherent loops can speed up 
the response to ON signals as well as produce adap- 
tation in which a response can disappear even though 


the signal is still present; this is a common feature in 
sensory systems. 

Many pathways consist of hundreds of elements and 
connections that often share components. While this 
“cross-talk” between pathways allows the limited num- 
ber of components in a cell to perform multiple tasks, 
insulating mechanisms must exist to enable specificity 
and fidelity of signals and avoid paradoxical situations 
in which an input specific to one pathway activates 
another pathway’s output or responds to another path- 
way’s input more strongly than its own. To understand 
how a cell directs information flows from diverse extra- 
cellular stimuli to multiple gene responses, new math- 
ematics on modularity, coarse-graining, and sensitivity 
are needed to map the dynamics of systems consist- 
ing of many components, drastic temporal scales, and 
stochastic effects. 

4 Communication and Organization 

How cells effectively and correctly recognize and re- 
spond to their microenvironment is critical to devel- 
opment, repair, and homeostasis of tissue, and to the 
immunity of the animal body. Different tissues — such 
as nerves, muscle, and epithelium— are derived from 
different parts of embryos. Errors made during commu- 
nication between cells cause abnormal development or 
lead to disease. Typically, cells communicate with other 
cells, either through releasing diffusive molecules into 
the extracellular space for signaling or by direct cell- 
to-cell contact. In the first mechanism, the molecules 
interact with the target cells through their receptors 
on the cell membrane, which in turn activate path- 
ways leading to different cellular functions. The action 
range of diffusive molecules can be short (e.g., within 
a distance of a few cells) or long (e.g., a distance of 
tens of cells) depending on the relative relationship 
between diffusion and reactions. These spatial dynam- 
ics are naturally described by reaction-diffusion equa- 
tions when tissues are modeled as continuum media. 
Analysis of such equations is possible through lin- 
ear or weakly linear stability when the systems have 
fewer than three components, but numerical inves- 
tigations of their dynamics are required when more 
components or higher spatial dimensions are involved. 
To create complex spatial patterns in tissue, such as 
stripes and spots, systems often need to utilize mul- 
tiple types of signaling molecules. One primary strat- 
egy is to use two diffusive molecules of drastically dif- 
ferent diffusion speeds, e.g., a slow activator and a 
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fast inhibitor. In this short-range activation/long-range 
inhibition mechanism, small initial disturbances can 
spontaneously create complex patterns that describe 
processes such as pigmentation, branching morpho- 
genesis, or the skeletal shape in the limb. 

Cell-to-cell contact through interactions between 
membrane-bound proteins in different cells is another 
major mechanism for patterning. A common exam- 
ple of such signaling is lateral inhibition, which often 
results in clusters of cells of one fate surrounded by 
cells of different fates. This mechanism, which often 
involves the notch signaling pathway, can increase the 
contrast and sharpness of responses (e.g., in the visual 
system) and is important in the central nervous sys- 
tem, in angiogenesis, and in endocrine development. 
Active motility driven by, for example, a chemical gradi- 
ent (chemotaxis), a light gradient (phototaxis), or a stiff- 
ness gradient (durotaxis) is another patterning mecha- 
nism that is often employed in wound healing, early 
development (e.g., sperm swimming toward an egg), 
and migration of neurons. 

Hybrid models that combine ordinary differential 
equations for the temporal dynamics of signaling 
and discrete and probabilistic descriptions of cells 
are important in capturing these patterning mecha- 
nisms when the role of individual cells is important. 
Subcellular element methods and cellular Potts mod- 
els are effective for simulating collective dynamics 
and emerging properties, both of which arise from 
these kinds of direct cell-to-cell communications. The 
multiscale features of these systems— which involve 
molecules in extracellular spaces, signal pathways with 
feedbacks inside cells, mechanics in the surround- 
ing tissues, and their physiological consequences— 
present great challenges for modeling and computa- 
tion, requiring innovation and transformative mathe- 
matical developments. 

5 Noise: Detrimental or Beneficial 

Noise and randomness exist at all scales of living organ- 
isms. The small numbers of binding sites on DNA, 
the fluctuations in biochemical reactions, the complex 
physical structures of intracellular and extracellular 
spaces, and the noise associated with environmental 
inputs all introduce uncertainty. Information that flows 
from one level to the next (e.g., from gene to cell) may 
be distorted if the noise propagates or is amplified. 

What are the general principles that give rise to noise 
attenuation? For a switching system consisting of ON 


and OFF states (e.g., calcium signaling and p53 regu- 
lation) stimulated by a temporal pulse input, a criti- 
cal intrinsic quantity, termed the signed activation time 
(SAT), succinctly captures the system’s ability to main- 
tain a robust ON state under noise disturbances. When 
the value of the SAT, defined as the difference between 
the deactivation and activation times, is small, the sys- 
tem is susceptible to noise in the ON state, whereas sys- 
tems with a large SAT buffer noise in the ON state. This 
theory, which was developed based on fluctuation dis- 
sipation theory and multi-timescale analysis, can also 
be used to scrutinize the noise in the OFF state through 
the concept of iSAT (input-dependent SAT). 

While noise is typically thought to impose a threat 
that cells must carefully eliminate, noise is also found 
to be beneficial in biology. For example, noise in gene 
expression can induce switching of cell fates, enabling 
the sharpening of boundaries between gene expres- 
sion domains. Without such gene noise, there would 
be an undesirable salt and pepper patterning region 
caused by noisy morphogen signals. Other examples 
of the benefits of noise include that noise can induce 
bimodal responses in cases of positive transcriptional 
feedback loops without bistability, noise can synchro- 
nize oscillations in cell-to-cell signaling, rapid signal 
fluctuations can lead to stochastic focusing, and sig- 
naling noise can enhance the chemotactic drift of cells. 
Apparently, life seems to find ways to deal with noise: 
control and attenuate it whenever possible, but exploit 
it otherwise. Systematic exploration of noise in complex 
interconnected biological systems requires new tools in 
stochastic analysis and computation. 

6 Conclusion 

The rapid increase that has been seen in the amount of 
available biological data provides a tremendous oppor- 
tunity for systems biology. Integration of new and old 
data on multiple scales, abstracting connections among 
key components, considering randomness and noise, 
and deriving principles from commonality will soon be 
at the center of scientific discovery. Systems biology, 
which has stimulated the development of many new 
mathematical and computational methods, is becom- 
ing an increasingly important approach in the era of 
big data. As systems biology holds the key to greater 
understanding of life, what systems biology will do for 
mathematics in the twenty-first century will be similar 
to what physics and mechanics did in the nineteenth 
century: it will give mathematics a new life. 
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1 Introduction 

A communication network is generally defined as a 
collection of objects (e.g., devices, people, cells) that 
use communication media such as optical fiber cables, 
sound waves, or signaling pathways to exchange infor- 
mation with one another using a common set of proto- 
cols (i.e., rules and conventions that specify the minute 
details of the actual information exchange). A particu- 
larly well-known, and arguably the largest, man-made 
communication network is the Internet, and it will fea- 
ture prominently in this article. This is not to say that 
other examples, especially some of biology’s communi- 
cation networks, are not interesting or that they are less 
important than the Internet. But studying the Internet 
has unique and appealing advantages when contrasted 
with biology. For one, we have full access to the Inter- 
net designers’ original intentions and to an essentially 
complete record of the entire evolutionary process. We 
also know in detail how the network’s individual com- 
ponents should work and how the components inter- 
connect to create system-level behavior. Moreover, the 
Internet offers unique opportunities for measurement 
and for collecting massive and often detailed data sets 
that can be used for an in-depth study of the network’s 
properties and features. 

As such, it is the design, operation, and evolution 
of this large-scale, highly engineered system that we 
focus on in this article. We use it as a concrete exam- 
ple for illustrating the types of mathematical concepts, 
approaches, and theories that are being developed 


in support of a rigorous first-principles treatment of 
large-scale communication networks. 

2 Internet Hourglass Architecture 

The early reasoning behind the design philosophy 
that shaped the Internet’s architecture consists of two 
main arguments. First, the primary objective was inter- 
networking, that is, the development of an effective 
technique for multiplexed utilization of already exist- 
ing interconnected (but typically separately adminis- 
tered) networks. Second, the primary requirement was 
robustness; that is, the network needed to be resilient 
to uncertainty in its components (e.g., router failures) 
and usage patterns (e.g., traffic -demand variations), and 
(on even longer timescales) to unanticipated changes in 
networking technologies and offered services. 

To achieve the stated key objective of internetwork- 
ing, the fundamental structure of the original architec- 
ture was the result of a combination of known technolo- 
gies, conscientious choices, and visionary thinking. It 
led to a packet-switched network in which a number of 
separate networks are connected together using packet 
communications processors called gateways. Commu- 
nication is based on a single universal logical address- 
ing scheme with a simple (net, host) hierarchy, and 
the gateways or routers implement a store-and-forward 
packet-forwarding algorithm. 

Similarly, to satisfy the crucial requirement for 
robustness, the early developers of the Internet relied 
heavily on two proven guidelines for system design, 
namely, the layering principle and the end-to-end argu- 
ment. To illustrate this, in the context of packet- 
switched networks the layering principle argues against 
implementing a complex task (e.g., a file transfer 
between two end hosts) as a single module. Instead, 
it favors breaking the task up into subtasks, each of 
which is relatively simple and can be implemented sep- 
arately. The modules corresponding to the different 
subtasks can then be thought of as being arranged in a 
vertical stack, where each layer in the stack is responsi- 
ble for performing a well-defined set of functionalities. 
The role of the end-to-end argument is then to help with 
the specifications of these functionalities and guide 
their placement among the different layers. It achieves 
these tasks by expressing a clear bias against low-layer 
function implementation and arguing for bottom layers 
that are kept as general and simple as possible. 

This design process was largely responsible for 
defining the main suite of protocols used in today’s 
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Internet, and it is known as the five-layer transmis- 
sion control protocol/Internet protocol (TCP/IP) proto- 
col stack. This “vertical” decomposition encompasses 
(from the bottom up) the physical, link, internetwork, 
transport, and application layers. The same design 
process was also instrumental in creating the closely 
related “hourglass” metaphor for the architecture of 
the Internet: an arrangement of the multilayer suite 
of protocols in which each protocol interfaces only 
with those in the layers immediately above and below 
it, and where IP, the protocol for the internetwork- 
ing layer of TCP/IP, occupies a separate layer at the 
hourglass’s waist, where it provides a generic packet- 
delivery service. 

This abstract bit-level network service at the hour- 
glass’s waist consists of a minimal set of widely agreed- 
upon features. These features have to be implemented 
according to an Internet-wide standard, they must be 
supported by all the routers in the network, and they 
are key to enabling communication across the global 
Internet. In this context, global robustness and seal- 
ability of routing, a key objective for IP, is achieved 
and implemented in a fully decentralized and asyn- 
chronous manner in TCP/IP (“horizontal” decomposi- 
tion). The layers below the waist deal with the wide 
variety of existing transmission and link technologies 
and provide the protocols for running IP over whatever 
bit-carrying network infrastructure is in place (“IP over 
everything”). Above the waist is where enhancements 
to IP (e.g., reliable packet delivery) are provided that 
simplify the process of writing application-level proto- 
cols (e.g., the Hypertext Transfer Protocol (HTTP) for 
the World Wide Web) through which users ultimately 
interact with the Internet (“everything over IP”). 

From today’s perspective, “IP over everything and 
everything over IP” is nothing but an ingenious heu- 
ristic solution to an extremely challenging engineer- 
ing problem that the early Internet designers faced. 
This problem manifested itself in the form of great 
uncertainty — uncertainty about the ways technologi- 
cal advances challenge or do away with conventional 
wisdom, about the manner in which the network will 
be used in the future, and about the ways a net- 
work comprised in part of unreliable components can 
fail. Indeed, during its transition from a small-scale 
research network some fifty years ago to a critical com- 
ponent of today’s world economy and a social phe- 
nomenon, the Internet has experienced drastic changes 
with respect to practically all imaginable aspects of 
networking. At the same time, its ability to scale (i.e., 


adapt to explosive growth), to support innovation (i.e., 
foster technological progress below and above the 
hourglass waist), and to ensure flexibility (i.e., follow 
a design that can evolve over time in response to 
changing conditions) has been remarkable. In short, 
the past fifty years have been testimony to the inge- 
nuity of the early designers of the Internet architec- 
ture. The fact that, by and large, their original design 
has been maintained and has guided the development 
of the network through a “sea of change” to what 
we call today’s Internet is an astounding engineering 
achievement. 

3 Mathematics Meets Engineering 

Despite the Internet’s enormous success, numerous 
things have happened since its inception that have 
questioned some of the fundamental architectural 
design choices that were made by its early design- 
ers. For one, the trust model assumed as part of the 
original design turned out to be the very opposite of 
what is required today (i.e., “trust no one” instead of 
“trust everyone”). In fact, this part of the design is the 
main culprit behind many of the security problems that 
plague today’s Internet. Moreover, some of the prob- 
lems that have been encountered have a tendency to 
create a patchwork of technical solutions that, while 
addressing particular short-term needs, may severely 
restrict the future use of the network and force it down 
an evolutionary dead end. To rule out such undesir- 
able “solutions” and prevent the problems from occur- 
ring in the first place, a mathematical language for 
reasoning about the design aspects and evolutionary 
paths of Internet-like systems is needed. The following 
examples illustrate recent progress toward formulat- 
ing such a language and providing the foundations for 
a rigorous theory of Internet-like systems. 

3.1 Internet (Big) Data 

The Internet is, by and large, a collection of intercon- 
nected computers, and computers excel in perform- 
ing measurements. This inherent ability for measure- 
ment has turned the Internet into an early source 
for what is nowadays referred to as “big data.” Read- 
ily accessible public data sets and carefully guarded 
proprietary data have transformed the scientific stud- 
ies of many Internet-related problems into prime 
examples of measurement-driven research activities. 
In turn, they have provided unprecedented oppor- 
tunities to test the soundness of proposed Internet 
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theories or check the validity of highly publicized 
“emergent phenomena”: measurement-driven discov- 
eries that come as a complete surprise, baffle the 
experts (at first), cannot be explained nor predicted 
within the framework of well-established mathemati- 
cal theories, and typically invite significant subsequent 
scientific scrutiny. 

One such discovery was made in the early 1990s 
when the first data sets of measured Internet traffic — 
in the form of high-quality (i.e., high-time-resolution 
time stamps), complete (i.e., no missing data), and 
high-volume (i.e., over hours and days) packet traces— 
became available for analysis. The picture that emerged 
from mining this data challenged conventional wis- 
dom because the observed traffic was the precise oppo- 
site of what was assumed. Rather then being smooth, 
with either no correlations or ones that decayed expo- 
nentially quickly, measured traffic rates on network 
links (i.e., the total number of packets per time unit) 
were “bursty” over a wide range of timescales. Impor- 
tantly, the observed burstiness was fully consistent 
with (asymptotic second-order) self-similarity, or, equiv- 
alently, long-range dependence, that is, autocorrela- 
tions that decay algebraically or like a power law. 

A second example of an emergent phenomenon was 
reported in the late 1990s when different data sets for 
studying the various types of connectivity structures 
that are enabled by the layered architecture of the Inter- 
net became available. While mathematicians focused 
primarily on more “virtual,” or higher-layer, connec- 
tivity structures, such as the Web graph (i.e., the link 
structure between virtual entities, in the form of Web 
pages), networking researchers were mainly interested 
in more “physical,” or lower-layer, structures, such as 
the Internet’s router topology, that is, the Internet’s 
actual physical link structure that connects its physi- 
cal components (i.e., routers and switches). Using the 
available data to infer this physical router-level Inter- 
net led to the surprising discovery that the observed 
node degree distributions had a completely unexpected 
characteristic. Instead of exhibiting the expected expo- 
nentially fast decaying right tail behavior that essen- 
tially rules out the occurrence of high-degree nodes in 
the Internet, the right tail of the measured node degree 
distribution decayed algebraically, like a power law. As 
a result, while most nodes have small degrees, high- 
degree nodes are bound to exist and have orders of 
magnitude more neighbors than a “typical” node in the 
Internet. 


3.2 A Tale of Two Discoveries 

Self-Similarity and Heavy Tails 

Given that there was no explicit mechanism in the orig- 
inal design of the Internet that predicted the observed 
self-similarity property of actual Internet traffic, its dis- 
covery fueled the development of new mathematical 
models that could explain this phenomenon. Of partic- 
ular interest were generative models that could explain 
how the observed scaling behavior can arise naturally 
within the confines of the Internet’s hourglass architec- 
ture. Some of the simplest such models were inspired 
by the pioneering works of Mandelbrot on renewal 
reward processes and Cox on birth-immigration pro- 
cesses. When applied to modeling Internet traffic, they 
describe the aggregate traffic rate on a network link (i.e., 
the total number of packets per time unit) as a super- 
position of many individual on-off processes. To a first 
approximation, each on-off process can be thought 
of as describing the activity of a single user as seen 
at the IP layer, and the user’s activity is assumed to 
alternate between sending packets at regular intervals 
when active (or “on”) and not sending any packets when 
inactive (or “off”). 

An important distinguishing feature of these models 
is that under suitable conditions on the individual on- 
off processes, and when properly rescaled, the aggre- 
gate traffic rate processes converge to a limiting pro- 
cess as the length of the time unit and the number 
of users tend to infinity. Provided that, for example, 
the length of a “typical” on period is described by a 
hea\y-tailed distribution with infinite variance, this lim- 
iting process can be shown to be fractional Gaussian 
noise, the unique stationary Gaussian process that is 
long-range dependent or, equivalently, exactly (second- 
order) self-similar. The appeal of these constructive 
models is that they identify a basic characteristic of 
individual user behavior as the main reason for aggre- 
gate Internet traffic being self-similar. Intuitively, this 
basic characteristic implies that the activity of indi- 
vidual users is highly variable: while most on periods 
are very short and see only a few packets, there are a 
few very long on periods during which the bulk of an 
individual user’s traffic is sent. 

These generative traffic models have been instrumen- 
tal in adding long-range dependence and heavy-tailed 
distributions to the mathematical modeler’s toolkit. 
They have also sparked important subsequent research 
efforts that have significantly advanced our under- 
standing of these two mathematical concepts, which 
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were viewed as esoteric and of no real practical value 
just two decades ago. Moreover, they have raised the 
bar with respect to model validation by identifying new 
types of measurements at the different layers within 
the TCP/IP protocol stack that can be used to test for 
the presence of heavy-tailed behavior in the packet 
traces generated by individual users. Not only has the 
initial discovery of self- similarity in Internet traffic 
withstood the test of time, but some twenty years later, 
the ubiquitous nature of the heavy-tailed characteris- 
tic of individual traffic components is now viewed as 
an invariant of Internet traffic. Similarly, the existence 
of long-range dependence in measured traffic, which 
reveals itself in terms of self-similar scaling behavior, 
is considered a quintessential property of modern-day 
Internet traffic. 

Engineered versus Random 

In contrast to the discovery of self-similarity in Inter- 
net traffic, the reported power-law claim for the Inter- 
net’s router topology did not withstand subsequent 
scrutiny and collapsed when tested with alternate mea- 
surements or examined by domain experts. This col- 
lapse ruled out the use of the popular scale-free 
[IV. 18 §3.1] graphs as realistic models of the Inter- 
net’s router topology. In contrast to the traditional 
erdos-renyi random graphs [IV.18 §4.1], which can- 
not be used to obtain power-law node degrees, scale- 
free graphs can easily achieve this objective by fol- 
lowing a simple stochastic rule whereby a new node 
connects with higher probability to an already highly 
connected node in the exiting graph (i.e., preferential 
attachment). For the last decade, the study of scale-free 
graphs and variants thereof has fueled the emergence 
and popularity of network science, a new scientific dis- 
cipline dedicated to the study of large-scale man-made 
or naturally occurring networked systems. 

The failure of scale-free graphs, in particular, and 
network science, in general, to cope with man-made 
physical structures such as the Internet’s router topol- 
ogy motivated a new approach to modeling highly engi- 
neered systems. Radically different from the inherently 
random connectivity that results from constructs such 
as Erdos-Renyi or scale-free graphs, this new approach 
is motivated by practical engineering considerations. It 
posits that the physical connectivity of the Internet is 
not the result of a series of (biased) coin tosses but is in 
fact designed; that is, it is based on decisions that are 
driven by objectives and reflect hard trade-offs between 


what is technologically feasible and what is economi- 
cally sensible. Importantly, randomness does not enter 
in the form of (biased) coin tosses but in the form of the 
uncertainty that exists about the “environment” (i.e., 
the traffic demands that the network is expected to 
carry), and “good” designs are expected to be robust 
with respect to changes in this environment. 

The mathematical modeling language that naturally 
reflects such a decision-making process under uncer- 
tainty is constrained optimization. As a result, this 
approach is typically not concerned with network 
designs that are “optimal” in a strictly mathemati- 
cal sense and are also likely to be np-hard [1.4 §4.1]. 
Instead, it aims at solutions that are “heuristically opti- 
mal,” that is, solutions that achieve “good” perfor- 
mance subject to the hard constraints that technol- 
ogy imposes on the network’s physical entities (i.e., 
routers and links) and the economic considerations 
that influence network design (e.g., budget limits). Such 
models have been discussed in the context of highly 
organized/optimized tolerances/trade-offs (HOT), and 
they show that the contrast with random structures 
such as scale-free graph-based networks could not be 
more dramatic. In particular, they highlight the fact 
that substituting randomness for any architecture- and 
protocol-specific design choices leaves nothing but an 
abstract graph structure that is incapable of provid- 
ing any Internet-relevant insight and is, in addition, 
inconsistent with even the most basic types of available 
router topology-related measurements. 

3.3 Networks as Optimizers 

The ability to identify the root cause of self-similarity in 
Internet traffic and to explain the Internet’s router-level 
structure from first principles are two examples of an 
ongoing research effort aimed at solving a “grand” chal- 
lenge for communication networks. This challenge con- 
sists of developing a relevant mathematical language 
for systematically reasoning about network architec- 
ture and protocol design in large-scale communication 
networks, including the Internet. 

At this point, the most promising candidate for pro- 
viding such a common language is layering as opti- 
mization decomposition. This approach is based on 
two key ideas. First, it treats network protocol stacks 
holistically and views them as distributed solutions of 
some global optimization problems. The latter are typ- 
ically formulated as some generalized network utility- 
maximization problems (“networks as optimizers”) and 
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are often the result of successful reverse engineer- 
ing, that is, the challenging task of starting from an 
observed network design that is largely based on engi- 
neering heuristics and discovering and formulating 
the mathematical problems being solved by the imple- 
mented design. Second, by invoking the mathematical 
language of decomposition theory for constrained opti- 
mization, the approach enables the principled study 
of how optimal solutions of a given global optimiza- 
tion problem can be attained in a modularized (i.e., 
layered) and distributed way (“layering as decompo- 
sition”). In the process, it facilitates the creative task 
of forward engineering, that is, the systematic com- 
parison of alternate solutions and the informed selec- 
tion of new designs based on their provably superior 
performance, efficiency, or robustness. 

Note that the HOT approach complements recent 
successful attempts at reverse engineering existing 
protocols within the TCP/IP stack (e.g., TCP and its 
variants, the Border Gateway Protocol for routing, and 
various contention-based medium-access control pro- 
tocols). It also illustrates that the basic idea of “net- 
works as optimizers” extends beyond protocols and 
is directly applicable to problems concerned with net- 
work design. On the other hand, since different opti- 
mal solutions typically correspond to different layer- 
ing architectures, “layering as decomposition” allows 
for a principled treatment of “how and how not to 
layer.” Importantly, it also provides a rigorous frame- 
work for comparing different protocol designs in terms 
of optimality, efficiency, or robustness with respect to 
perturbations of the original global optimization prob- 
lem. The uncertainty due to heavy-tailed user activity — 
the root cause of the self-similar nature of Internet 
traffic — is one such perturbation with respect to which 
the designed protocols ought to be robust. 

4 Outlook 

As our understanding of Internet-related communica- 
tion networks deepens, it becomes more and more 
apparent that in terms of architecture and protocol 
design, technological networks are strikingly similar 
to highly evolved systems that arise, for example, in 
genomics or molecular biology, despite having com- 
pletely different material substrates, evolution, and 
assembly. A constructive scientific discourse about 
architectural designs and protocols arising in the con- 
text of highly engineered or highly evolved systems 
therefore looms as a promising future research objec- 
tive. The examples and recent developments discussed 


in this article offer hope that such a discussion can 
and will be based on a rich and relevant mathematical 
theory that succeeds in making everything “as simple 
as possible but not simpler.” 
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VII.24 Text Mining 

Dian I. Martin and Michael W. Berry 


1 Introduction 

Text mining is the automated analysis of natural lan- 
guage text data to derive items of meaning contained in 
that data and find information of interest. Automation 
is required due to the volume of text involved. Further 
complicating matters is the fact that text collections are 
typically unstructured, or very minimally structured. 
Text mining activities involve searching the data collec- 
tion for specific information, classifying items within 
the collection to derive useful characteristics, and ana- 
lyzing the content of a collection to arrive at an overall 
understanding of the body of information as a whole. 
The ability to partition the data into understandable, 
meaningful units for a user is a challenge. 

While some methods for text mining attempt to 
utilize linguistic properties and grammars, most of 
the methods employed in text mining are based on 
mathematics. These methods include statistical and 
probabilistic approaches, simple vector space mod- 
els, latent semantic indexing, and nonnegative matrix 
factorization. 
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2 Statistical and Probabilistic 
Evaluation of Text 

Statistical methods for evaluating text collections have 
been in use since the 1950s. These methods look at 
term frequency and probability of term usage to assign 
significance to particular words or phrases. These 
methods may be further informed by the use of dictio- 
naries, grammars, stop lists, or other language-specific 
information. A stop list, for example, is a list of unim- 
portant or unwanted words that are discarded during 
document parsing and are not used as referents for any 
document. The end result of this analysis is to produce 
a significance value for particular terms, sentences, or 
passages within a collection. These approaches can be 
susceptible to changing language use and are difficult 
to generalize across widely separated collections of 
information and different languages. However, these 
techniques are still in use today in the areas of search 
engine optimization and the generation of tag clouds. 

3 The Vector Space Model 

The vector space model (VSM) was developed to handle 
text retrieval from a large information database with 
heterogeneous text and varied vocabulary. The under- 
lying formal mathematical model of the VSM defines 
unique vectors for each term and document by repre- 
senting terms and documents in a large sparse matrix. 
The columns are considered the document vectors and 
the rows are considered the term vectors. The matrix 
is populated with the term frequency of each term in 
each document. A weighting can then be applied to each 
entry of the matrix in order to increase or decrease the 
importance of terms within documents and across the 
entire document collection. Similarity or distance mea- 
surements can then be calculated within the context 
of the vector space. Two such measures, cosine and 
Euclidean distance, are easily calculated between any 
two vectors in the space and are frequently used. 

One of the first systems to use a traditional VSM was 
the system for the mechanical analysis and retrieval 
of text (SMART), developed by Gerard Salton at Cor- 
nell University in the 1960s. Among the notable char- 
acteristics of the VSM used by SMART is the premise 
that the meaning of a document can be derived from 
its components. The VSM is useful for lexical matching 
and exploiting term co-occurrence among documents. 
Searching the VSM for items similar to a given docu- 
ment, or query, from outside of the collection is pos- 
sible by constructing a query vector within the VSM. 


Given the nature of the term-by-document matrix, a 
query vector is formed in the same way a document is 
constructed in the matrix: by giving those rows corre- 
sponding to the terms from the query a frequency num- 
ber, followed by a weighting, and finding documents 
that are relevant to the query vector based on a sim- 
ilarity measure. Similarities between queries and doc- 
uments are then based on concepts or similar seman- 
tic content. Exploiting the mathematical foundation of 
a VSM for a document collection by first creating the 
term-by- document matrix and then calculating similar- 
ities between queries and documents as just described 
is beneficial for searching through a large amount of 
information efficiently. 

4 Latent Semantic Indexing 

Expanding on the term-by-document matrix of the VSM, 
latent semantic indexing (LSI) uses a matrix factoriza- 
tion to simultaneously map the contents of a doc- 
ument collection on a set of orthogonal axes, scale 
those mapping vectors across the collection accord- 
ing to the singular values, and reduce the dimension- 
ality of the representation to obtain the latent struc- 
ture of a document collection. The terms and docu- 
ments are organized into a single large matrix, just as 
in the vector space model, and diagonal row scaling is 
used to effect a particular term-weighting scheme. Prior 
to scaling, each matrix cell is simply the nonzero fre- 
quency of a term within a document. This matrix is then 
processed using the singular value decomposition 
[11.32] (SVD). 

The SVD produces a dense multidimensional hyper- 
space representation of the information collection (an 
LSI space) containing vectors corresponding to the 
terms and documents of the collection. Within this 
semantic space, the meaning of a term is represented 
as the average effect that it has on the meaning of doc- 
uments in which it occurs. Similarly, the meaning of a 
document is represented as the sum of the effects of all 
the terms it contains. The position of terms and docu- 
ments in the vector space represents the semantic rela- 
tionships between those terms and documents. Terms 
that are close to one another in the LSI space are consid- 
ered to have similar meaning regardless of whether or 
not they appear in the same document. Likewise, doc- 
uments are identified as similar to each other if they 
have close proximity in the LSI space regardless of the 
specific terms they contain. 
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4. 1 Dimensionality 

The SVD allows for the adjustment of the representa- 
tion of terms and documents in the vector space by 
choosing the number of dimensions: that is, the num- 
ber of singular values that are kept (the remainder 
being discarded or, equivalently, set to zero). This con- 
trols the number of parameters by which a word or 
document is described. The number of dimensions is 
usually much smaller than either the number of terms 
or documents, but it is still considered a large num- 
ber of dimensions (typically 200-500). This reduction 
of dimensionality is the key to sufficiently capturing 
the underlying semantic structure of a document col- 
lection. It reduces the noise associated with the vari- 
ability in word usage and causes minor differences in 
terminology to be ignored. 

Selecting the optimal dimensionality is an important 
factor in the performance of LSI. The conceptual space 
for a large document collection needs more than a few 
underlying independent concepts to define it. Using a 
low number of dimensions is undesirable as it does 
not produce enough differentiation between terms and 
documents, whereas full dimensionality provides lit- 
tle semantic grouping of terms and documents and in 
effect treats every term and document as being unique 
in meaning. 

Figure 1 depicts a simple three-dimensional repre- 
sentation of the LSI space composed of both term 
vectors and document vectors. This illustration is an 
extremely simplified representation using only three 
dimensions. In practice, the LSI hyperspace will typi- 
cally have anywhere from 300 to 500 dimensions or 
more. 

When this processing is completed, information 
items are left clustered together based on the latent 
semantic relationships between them. The result of this 
clustering is that terms that are similar in meaning are 
clustered close to each other in the space and dissim- 
ilar terms are distant from each other. In many ways, 
this is how a human brain organizes the information 
that an individual accumulates over a lifetime. 

4.2 The Difference between the Vector Space 
Model and Latent Semantic Indexing 

The difference between the traditional VSM and the 
reduced-dimensional VSM used by LSI is that terms 
form the dimensions or the axes of the vector space in 
the traditional VSM and documents are represented as 
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vectors in the term space, whereas the axes or dimen- 
sions in LSI are derived from the SVD. Therefore, since 
the terms are the axes of the vector space in the tradi- 
tional VSM, they are orthogonal to each other, causing 
documents that do not contain a term in a user's query 
to have a similarity of zero with the query. The result is 
that terms have their own unique independent mean- 
ings. With LSI, the derived dimensions from the SVD 
are orthogonal, but terms (as well as documents) are 
vectors in the reduced-dimensional space, not the axes. 
Terms no longer have unique meaning on their own, nor 
are they independent. Terms get their meanings from 
their mappings in the semantic space. 

4.3 Application 

LSI can be used to search, compare, evaluate, and 
understand the information in a collection in an auto- 
mated, efficient way. In fact, LSI provides a compu- 
tational model that can be used to perform many of 
the cognitive tasks that humans do with information 
essentially as well as humans do them. The effective- 
ness and power of LSI lies in the mathematical calcu- 
lation of the needed part of the SVD: that is, the par- 
tial SVD of reduced dimension. Given a large term-by- 
document matrix A, where terms are in the rows and 
documents are in the columns, the SVD computation 
becomes a problem of finding the k largest eigenvalues 
and eigenvectors of the matrix B = A T A. 

Finding the eigenvectors and eigenvalues of B pro- 
duces the document vectors and the singular values 
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(the nonnegative square roots of the eigenvalues of B). 
The term vectors are then produced by back multiply- 
ing. Thus, the SVD computation is based on solving a 
large, sparse symmetric eigenproblem. The approach 
most effectively used to compute the SVD for this appli- 
cation is based on the Lanczos algorithm. The Lanczos 
algorithm, which is an iterative method, is proven to be 
accurate and efficient in solving large, sparse symmet- 
ric eigenproblems where only a modest number of the 
largest or smallest eigenvalues of a matrix are desired. 
While the computation of the reduced-dimensional vec- 
tor space for a term-by-document matrix is a nontriv- 
ial calculation, advanced implementations of LSI have 
been shown to be scalable to address large problem 
sizes. 

5 Nonnegative Matrix Factorization 

LSI can be quite robust in identifying which documents 
are related. While it produces a set of orthogonal axes 
that forms a mapping analogous to the cognitive rep- 
resentation of meaning, it does not produce a set of 
conveniently labeled features that can be examined 
intuitively. Nonnegative matrix factorization (NMF) is 
another approach that produces decompositions that 
can be readily interpreted. Lee and Seung (1999) intro- 
duced the NMF in the context of text retrieval. They 
demonstrated the application of NMF in both text min- 
ing and image analysis. In our context, NMF decom- 
poses and preserves the nonnegativity of the original 
term-by- document matrix whereby the resulting non- 
negative matrix factors produce interpretable features 
of text that tend to represent usage patterns of words 
that are common across the given document corpus. 

5.1 Constrained Optimization 

To approximate the original (and possibly term-weight- 
ed) term-by-document matrix A, NMF derives two 
reduced-rank nonnegative matrix factors W and H such 
that A ss WII. The sparse matrix W is commonly 
referred to as the feature matrix containing feature 
(column) vectors representing usage patterns of promi- 
nent weighted terms, while H is referred to as the coef- 
ficient matrix because its columns describe how each 
document spans each feature and to what degree (see 
figure 2). 

In general, the NMF problem can be stated as fol- 
lows: given a nonnegative real-valued mx n matrix A 
and an integer k such that 0 ^ k ^ min (m,n), find 
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Figure 2 Nonnegative feature and coefficient matrices 
from NMF; peaks or large components in individual feature 
(weight) vectors are dominant terms (documents). The sub- 
graphs under each matrix factor ( W and H) reflect the values 
of the nonnegative components of each column. 

two nonnegative matrices W (to x k) and H (kxn) that 
minimize the cost function 

f(W,H) = ||A — WH\\p, 

where the norm is the frobenius norm [1.2 §20]. 

5.2 Convergence Issues 

The minimization of f(W,H ) can be challenging due 
to the existence of local minima owing to the fact that 
f(W,H) is nonconvex in both W and H. Due to the 
underlying iterative process, NMF-based methods do 
not necessarily converge to a unique solution, so the 
resulting matrix factors ( W,H ) depend on the initial 
conditions. One approach to remedy the nonunique 
solution problem is to avoid the use of randomiza- 
tion for the initial W and H factors. One common 
approach is to use the positive components of the trun- 
cated SVD factors of the original term-by- document 
matrix. The LSI factors described above can therefore 
be used to seed the iterative NMF process with a fixed 
starting point (initial W and H) that will converge to 
the same minima (final W and H) for repeated fac- 
torizations. Having multiple NMF solutions does not 
necessarily mean that any of the solutions must be 
erroneous. However, having a consistent ordering of 
interpretable features is definitely advantageous for 
knowledge-discovery applications. 

6 Automation and Scalability 

All of the methods discussed above leverage mathemat- 
ical decompositions to analyze the meaning of natural 
language text where the structure is not explicit. These 
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methods provide a means of processing text in an auto- 
mated way that is capable of handling large volumes 
of information. With the continued expansion of the 
amount of text to be analyzed, techniques such as LSI 
and NMF can be used for scalable yet robust document 
clustering and classification. 
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VII.25 Voting Systems 

Donald G. Saari 


1 Paradoxical Outcomes 

As only addition is needed to tally ballots, I once mis- 
takenly believed that “mathematical voting theory” was 
an oxymoron. To explore this thought, consider a sim- 
ple setting in which 19 voters are selecting a social club 
president from among Ann, Barb, and Connie. Suppose 
the profile (i.e., a list of how many voters prefer each 
ranking) is as in table 1. 

It just takes counting to prove that Ann wins; the plu- 
rality (vote-for-one) outcome is A > B > C with tally 
8:7:4. Counting even identifies the “vote-for-two” win- 
ner as . . . Connie. This is a surprise because Connie is the 
plurality loser! Not only does this voting rule change 
the “winner,” but its election ranking of C > B > A 
(with tally 14:13:11) reverses the plurality ranking. To 
add to the confusion, the Borda count winner (ascer- 
tained from assigning 2, 1, and 0 points, respectively, to 
a ballot’s top-, second-, and third-positioned candidate) 
is Barb. So Ann wins with one rule, Barb with another, 
and Connie with a third. Perhaps clarity comes from the 
majority-vote paired comparisons because Barb beats 


Table 1 Social club president voting profile. 


Number 

Ranking 

Number 

Ranking 

2 

A > B > C 

4 

C > B > A 

6 

A> C>B 

4 

B>C> A 

0 

C> A> B 

3 

B> A> C 


Ann (11:8) and Ann beats Connie (11:8), indicating the 
B > A > C outcome. But no, Connie beats Barb (10:9), 
thus creating a cycle, so this rule’s outcome suggests 
that no candidate is favored! 

This example captures a concern that should bother 
everyone: election outcomes can more accurately reflect 
the choice of a voting method rather than the intent of 
the voters. A crucial role played by mathematics is to 
explain why this can happen and to identify which rules 
have outcomes that most accurately reflect the voters’ 
views. More generally, because “voting” serves as a pro- 
totype for general aggregation methods, such as those 
used in statistics or in approaches that are central to 
the social sciences, expect the kinds of problems that 
arise with voting rules to identify difficulties that can 
arise in these other areas. 

To tackle these issues, mathematical structures must 
be found. To do so, assign each alternative to a vertex 
of an equilateral triangle, as in figure 1(a). The rank- 
ing assigned to a point in the triangle is determined by 
its proximity to each vertex: for example, points on the 
vertical perpendicular bisector represent indifference, 
or an A-B tie, denoted by ,4 ~ II. The perpendicular 
bisectors divide figure 1(a) into six regions, with each 
small triangle representing all of the points with a par- 
ticular ranking. For example, region 1 points are closest 
to A, and next closest to B, so they have the A> B > C 
ranking; region 5 represents the B > C > A ranking. 

To exploit this geometry, place the introductory 
example’s profile entries in the associated region; this 
is illustrated in figure 1(b). Two voters have the region 1 
ranking, for instance, so place “2” in this triangle; three 
voters have the region 6 ranking, so place “3” in that 
triangle. An advantage gained by this geometry is that 
it separates the entries in a manner that simplifies the 
tallying process. With the {A,£} vote, for example, all 
voters preferring A to B are in the shaded region to 
the left of the vertical line in figure 1(b). Thus, to com- 
pute this majority-vote tally, just add the numbers on 
each side of this line. Doing the same with each of fig- 
ure l(b)’s perpendicular bisectors leads to the tallies 
listed by each edge. 
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(a) C 





Figure 1 Profile representations, and tallying ballots: 
(a) ranking regions; (b) paired comparisons; and (c) posi- 
tional outcomes. 


Similar algebraic computations prove that the profile in 
figure 1(b) admits seven different positional outcomes, 
four of which are strict rankings, that is, rankings 
without any ties. 

In general and by using the geometric structures of 
higher- dimensional simplexes, we now know that, for 
n candidates, profiles exist that allow precisely k dif- 
ferent strict election rankings, where k is any integer 
satisfying 1 ^ k ^ (n - 1 )[(n - 1)!]; for example, with 
n = 10 candidates, a profile can be created that has 
3 265 920 different positional election rankings. Among 
these rankings, it is possible for each candidate to be 
first, then second, then third, then ..., then last ranked 
just by using different positional voting rules! 

As is typically true in applied mathematics, results 
are often discovered via experimentation. So, let me 
invite the reader to use the above methods to solve 
the following four problems. Answers (given later) 
motivate the mathematical structures of voting rules. 


The plurality tally is equally simple. Regions shar- 
ing a vertex share the same top-ranked candidate (e.g., 
both shaded regions in figure 1(c) have A as the top- 
ranked candidate), so add these values. This summa- 
tion defines the values that are listed by the vertices 
after selecting the s value (which is introduced next) to 
be 5 = 0, or the A> B > C tally of 8:7:4. 

“Positional rules” use specific weights (uq, 11/2,1^3), 
iv j ^ Wj+ 1, where uq = 0 and uq > 0. To tally a ballot, 
assign Wj points to the jth-positioned candidate. The 
plurality vote, then, is (1,0,0), the vote-for-two rule 
gives (1, 1,0), and the Borda count gives (2, 1,0). To 
simplify the process, normalize the values by dividing 
by wi to obtain w s = (1,5,0), where s = uq/uq is 
the “second place” value; for example, the normalized 
Borda count is uq/ 2 = (1, *,0). 

With this normalization, a candidate's tv s tally be- 
comes “her plurality tally plus {5 times her second place 
votes}.” For instance, as A is second ranked in the two 
regions with arrows in figure 1(c), her w s tally is 8 + 3s; 
the B and C tallies are similarly computed and listed 
by their vertices. To illustrate, the “vote-for-two” C > 
B > Atallyof 14:13:11 in the introductory paragraph is 
recovered by using s = 1; the normalized Borda (s = * ) 
B > A > C tally is 10:9j:9, so the standard (2,1,0) 
Borda tally is double that: 20:19:18. 

This tallying approach allows questions to be an- 
swered with simple algebra; for example, to find which 
uq rule causes an A ~ B tie, the equation 8 + 3s = 7 + 6s 
proves it is s = 5, which is equivalent to a (3, 1,0) rule. 


(i) Not all voters want A to be the figure 1(c) plurality 
winner. Identify all voters who could vote “strate- 
gically” to force a personally preferred outcome. 

(ii) Create a profile with the A > B > C plurality out- 
come and a 10:9:8 tally in which the paired compar- 
ison (majority- vote) outcome is an A > B, B > C, 
C > A cycle. 

(iii) Create a different profile with the same plurality tal- 
lies but where B now beats both A and C in majority 
votes; by beating everyone, Barb is the Condorcet 
winner. 

(iv) Create a third profile with these plurality tallies but 
the vote-for-two outcome is whatever you want: B > 
C > A, say. 

2 Two Central Results 

Solving these questions makes it clear that the infor- 
mation used in paired comparisons differs from that 
used with plurality or any w s method. This assertion is 
supported by the shaded regions in parts (b) and (c) of 
figure 1: A’s plurality tally uses information from two 
regions, while A’s majority vote uses three. Differences 
become more pronounced with w s methods where A’s 
outcome involves information from four regions! But 
if rules use different information, then different out- 
comes must be expected. This raises an interesting 
mathematical challenge: to invent a voting rule that is 
free from paradoxical behaviors. 

In examining this issue, Kenneth Arrow put forth the 
following ground rules. 
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(i) Voters have complete, transitive preferences, i.e., 
they rank each pair of candidates, and if a voter 
prefers X > Y, and Y > Z, then the voter prefers 
X > Z. 

(11) To ensure reasonable outcomes, the societal rank- 
ing is also required to be complete and transitive. 
(The “societal ranking” is the outcome of the group 
decision rule; if the rule is an election, then this is 
the “election ranking.”) 

(iii) The rule satisfies a unanimity condition (called 
the Pareto condition) where for any pair {X, Y] , if 
everyone prefers X > Y, then X > Y is the societal 
outcome. 

(iv) Why just unanimity? When determining the {Ann, 
Barb} societal ranking, what voters think about 
Connie should not matter. Arrow’s independence of 
irrelevant alternatives (IIA) condition requires each 
pair’s societal ranking to depend only on how each 
voter ranks that particular pair; other information 
is irrelevant. 

While these conditions appear to be innocuous, the 
surprising fact is that only one rule satisfies them: a dic- 
tator (i.e., the rule is a function of one variable)! Namely, 
the rule can be treated as selecting a particular voter, 
say Mikko. Then, for all elections the rule’s outcome 
merely reports Mikko’s preferences as the societal out- 
come. From a mathematical perspective, Arrow's theo- 
rem proves that the information used by paired com- 
parisons (figure 1(b)) is incapable of determining tran- 
sitive rankings for three or more candidates. What kind 
of information is appropriate? (For a different interpre- 
tation of Arrow’s result, see the last listed reference in 
the further reading section below.) 

Another central result addresses whether strategic 
voting can be avoided. A “strategic vote” is one where a 
voter votes in a crafty manner to obtain a personally 
preferred outcome. This option is not available with 
two candidates, say Ann and Barb, where Ann will win a 
majority vote. Someone supporting Barb has precisely 
two choices, but neither is strategic. This is because 
voting for Barb is sincere, while voting for Ann is coun- 
terproductive, i.e., a “two-candidate” setting does not 
provide enough wiggle room to be strategic. 

Three candidates admit more possibilities. With fig- 
ure 1(c), a nonsincere vote is counterproductive for sup- 
porters of A or B. What remains are those C voters who 
prefer B to A; by strategically voting for B, rather than 
C, they achieve a personally preferred outcome ( B over 
A); for example, strategic voting moves the four votes 



10 + Os 9 + 17s 

Figure 2 Creating “paradoxes”: (a) cycle; 

(b) Condorcet B\ and (c) vote-for-two B > C > A. 

by the C vertex toward B. A similar analysis holds for 
any positional rule; for example, the figure 1(c) vote- 
for-two winner is C, while B is the Borda winner. In each 
case, there are ways for certain voters to be strategic. 

For essentially these geometric reasons, the Gibbard- 
Satterthwaite theorem asserts that, for any group deci- 
sion or election rule involving three or more candidates, 
there exist situations where some voter can be strate- 
gic. Technical conditions are added to ensure that, say, 
it is possible for each candidate to win with some pro- 
file (e.g., the rule is not a de facto comparison of one 
pair). While the proofs are combinatoric, the mathe- 
matical reasons are essentially as above; this theorem 
is a directional derivative result where three or more 
alternatives are needed to provide enough “directions.” 

3 The Mathematical Structure 
of Positional Rules 

Answers to questions (ii)-(iv) (figure 2) identify the 
basic mathematical structure of voting rules. The plu- 
rality tally constraint requires placing numbers in the 
triangle that will have the specified sum by each vertex, 
e.g., the two numbers by the A vertex must sum to 10. 
Combinatorics proves that there are 11 x 10 x 9 = 990 
ways to do so. To identify choices preferring A over B 
in a majority vote, insert a shaded bar in the A > B 
region to the left of the vertical line (figure 2(a)). Simi- 
larly, to highlight B over C, place a shaded bar on B’s 
side of the perpendicular bisection, with a similar bar 
for C over A. Next, select values to stress the shaded 
triangle; emphasizing its vertices creates the profile 
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in figure 2(a), which will have a heavy cyclic vote. To 
answer (iii), place shaded bars (figure 2(b)) to identify 
where E dominates each of the other candidates. 

Indeed, it is possible to select any ranking for each 
pair, and profiles can be generated where the plural- 
ity tallies and paired outcomes are as specified. As 
the rules use different information, the mathemati- 
cal assertion is that consistent election outcomes for 
these rules cannot be expected, nor are they even very 
likely. This negative conclusion holds for any number 
of candidates and most positional rules. 

To solve (iv) with the figure 1(c) u> 5 -tallying method, 
emphasize “second-place votes.” To keep plurality win- 
ner A from receiving a larger tally, avoid placing num- 
bers in her two figure 2(c) “second-place” regions, 
which are indicated by the dashed double-headed 
arrow. To assist B, place values in her second-place 
regions indicated by the solid double-headed arrow. 
An extreme (figure 2(c)) profile has the plurality win- 
ner A but a vote-for-two ranking B > C > A with tally 
26:17:10. 

With these tools, the reader can create examples 
to illustrate almost all of the possible three-candidate 
paradoxical behaviors. Of more value is to use these 
tools to extract the mathematical structures that will 
explain these millennia-old puzzles. To do so, notice 
how paired comparison differences (figure 2(b)) involve 
the vertices of the shaded equilateral triangles; they 
create combinations of profiles of the form 

A> B > C, B > C > A, C> A>B, (1) 

with regions 120° apart. Positional differences are cre- 
ated by 

A > B > C, C>B>A , (2) 

expressions with regions diametrically apart. 

In particular, the symmetry structures exposed by my 
triangle approach identify the source of voting com- 
plexities. To capture these symmetries, express them 
as orbits of symmetry groups. The simplest group is 
2.2 = { I,R } consisting of the identity map I and a rever- 
sal R, where R 2 = R ° R = /; applying this group to any 
ranking X > Y > Z yields the structure in (2); I keeps 
the ranking and mapping R reverses it. 

It is easy to show that differences in paired compar- 
isons are immune to these Z 2 structures, which cause 
differences in positional outcomes, i.e., this orbit is 
in the kernel of differences between paired compari- 
son tallies. This is the mathematical property that per- 
mits outcomes for positional methods to differ as radi- 
cally as desired from paired comparison rankings. The 


unique exception is the Borda count, which also has 
this Z 2 orbit structure in its kernel. Thus only Borda 
rankings must be related to paired comparison rank- 
ings. More precisely, for any other positional method, 
profiles exist for which the positional outcome can be 
whatever is wished, say A > B > C, but for which the 
paired comparison is C > B > A\ this kind of behavior 
can never happen with Borda. Assertions of this form 
extend to any number of candidates. 

The next-simplest permutation group is defined by 
P = (1,3,2), where the first-placed entry of X > Y > Z 
is moved to third place, the third-placed entry is moved 
to second, and the second-placed entry is moved to 
first place, creating Y > Z > X\ orbits of this {I,P,P 2 } 
group, where P 3 = I (so, permutation P is applied three 
times to a ranking), generate the behavior in ( 1 ), which 
causes all possible paired comparison differences. As 
this Z 3 structure has each candidate in each position 
precisely once, the Z 3 symmetry affects paired compar- 
isons, but it never affects differences in tallies for any 
positional rule. Thus, this symmetry structure in a pro- 
file creates differences between paired and positional 
outcomes; again, Borda is immune to its effects. 

Surprisingly, for three candidates, these structures 
completely explain all paradoxical outcomes, and they 
answer all classical questions concerning differences 
in positional and paired comparisons (details can be 
found in the last two listed references in the further 
reading section). Indeed, to create the introductory 
example, I started with one voter preferring B > C > 
A, added appropriate Z 2 reversal structures to create 
desired positional outcomes (they never affect Borda or 
paired rankings), and then added a sufficiently strong 
Z 3 paired comparison component (they never affect 
positional or Borda rankings) to create a cyclic paired 
comparison outcome. 

Note that only the Borda count is immune to Z 2 and 
Z 3 symmetry structures that create paradoxical con- 
clusions. This is what makes it easy to construct argu- 
ments showing why the Borda count is the unique rule 
that most accurately represents voter interests. 

Elections involving more candidates require finding 
appropriate symmetry group structures that affect one 
kind of subset of alternatives but not others. As an 
example, orbits of the Klein four-group (capturing sym- 
metries of squares) do not affect rankings for paired 
comparisons or four-candidate positional rankings, but 
they change positional rankings for triplets. Again, only 
the Borda count (assign n - j points to a Jth-positioned 
candidate), which is a linear function, places all of these 
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orbit structures in its kernel. As an immediate conse- 
quences and for any number of candidates, only Borda 
is not affected by the symmetries that cause paradoxi- 
cal outcomes. That is, the Borda count is the unique rule 
that minimizes the kinds and likelihoods of paradoxical 
outcomes. 

Another class of voting theory explores what can hap- 
pen by changing a profile. An obvious example is the 
above mentioned Gibbard-Satterthwaite strategic vot- 
ing result, where a sincere profile is converted into 
a strategic one. Because this theorem proves that all 
rules are susceptible to strategic efforts, the next nat- 
ural question is to find a positional rule that is least 
affected by this behavior, that is, find the rule that is 
most unlikely to allow successful strategic actions. (The 
answer is the Borda count.) Other issues include under- 
standing how a winning candidate can lose an election 
while receiving more support, or how, by not voting, 
a voter can be rewarded with a personally preferred 
outcome. 

These results involve the geometric structure of sets 
of profiles in profile space. The Borda assertion about 
strategic voting, for instance, reflects its symmetry 
structure, where the differences between successive 
(2, 1, 0) weights agree. To indicate how this symmetry 
plays a role, recall how geometric symmetries reduce 
boundary sizes, e.g., of all rectangles with area one, the 
rectangle with the smallest boundary (perimeter) is a 
square. Similarly, for the set of profiles defining a given 
election ranking, the positional rule with the smallest 
boundary is the Borda count; this boundary consists 
of tie votes. But for a strategic voter to successfully 
change an election outcome, the profile’s election out- 
come must be nearly a tie, i.e., the profile must be near 
a boundary. So, by having the smallest boundary, the 
Borda count admits the smallest number of strategic 
opportunities. In contrast, the plurality vote, with its 
larger boundary, offers the greatest number of strategic 


opportunities, which means, as one would expect, that 
it is highly susceptible to successful strategic actions. 
Notice the troubling conclusion: the voting rule that 
is most commonly used to make group decisions— the 
plurality vote — is the most likely to have questionable 
outcomes, and it is the most susceptible to strategic 
actions. 

Other results— such as the situation in which receiv- 
ing more support hurts a candidate or the situation in 
which not voting helps a voter— involve the geometry 
of regions of profile space combined with directional 
derivative sorts of arguments. As described above, for 
instance, the two rules used in a “runoff election” (a 
positional rule for the first election, a majority vote for 
the runoff) involve different profile structures; this dif- 
ference forces the set of all profiles in which a particu- 
lar candidate is the “winner” to have dents in its struc- 
ture (it is nonconvex). The lack of convexity admits set- 
tings in which a winning candidate loses by receiving 
more votes; namely, the straight line created by adding 
supporting voters moves the new profile outside of the 
“winning region.” (This could happen, for example, if 
the added support for the previously winning candidate 
changed her runoff opponent.) 
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Final Perspectives 


VIII. 1 Mathematical Writing 

Timothy Gowers 


1 Introduction 

The purpose of this article is not to offer advice about 
how to write mathematics well. Such advice can be 
found in many places. However, I do have three pieces 
of very general advice, which inform the rest of the arti- 
cle. The first is to be clear about your intended read- 
ership; for example, if you want what you write to be 
understood by an undergraduate, then do not assume 
knowledge of any terminology that is not standardly 
taught in undergraduate mathematics courses. The sec- 
ond is to aim for this readership to be as wide as pos- 
sible. If, with a small amount of extra explanation, you 
can make what you write comprehensible to an expert 
in another held of mathematics, then put in that extra 
effort. Whatever you do, do not worry that experts will 
not need the explanation; if that is true (which it often 
is not), then they can easily skip it. The third, which 
is related to the second, is to set the scene before you 
start. Some people who read what you have written will 
do so because they want to understand all the techni- 
cal details and use them in their own work, but the vast 
majority will not. Most readers, including people who 
need to make quick judgments that will profoundly 
affect your career, will want to read the introduction 
quickly to see what you have done and assess how 
important it is. However much you might wish every- 
one to read what you have written in complete detail, 
you should be realistic and cater for readers who just 
want to skim it. 

For the rest of this article I shall discuss various 
choices that one must make when writing a mathemat- 
ical document. I will not advocate choosing one way 
rather than another, since the choices you should make 
depend on what you want to achieve; my final piece 
of general advice is merely that you should make the 
choices consciously rather than by accident. 


2 Formality versus Informality 

There are (at least) two goals that one might have when 
writing a mathematical document. One is to establish a 
mathematical result by whatever means are appropri- 
ate to the field; in pure mathematics the usual require- 
ment is unambiguous definitions and rigorous proofs, 
whereas in applied mathematics other forms of evi- 
dence, such as heuristic arguments and experimental 
backing, may be acceptable. The other is to convey 
mathematical ideas to the reader. 

These two goals are often in tension. If a pure math- 
ematician discovers a complicated proof of a theorem, 
then that proof will be hard to understand. However, 
sometimes the apparent complication of a proof is mis- 
leading; what is really going on is that the author had 
one or two key ideas, and the complication of the argu- 
ment is the natural working out of the details of those 
ideas. For the expert reader, an informal explanation of 
the ideas that drive the proof may well be more valuable 
than the proof itself. 

Nobody would advocate writing papers with just 
informal explanations of ideas, since plausible-looking 
ideas often turn out not to work. However, there is still 
a choice to make, since it is considered acceptable to 
display proofs and not explain the underlying ideas. 
There may sometimes be circumstances where this is 
appropriate; for example, perhaps the proof is short, 
and explaining the ideas that generate it will double 
the length of what you are writing and put off readers. 
But usually, the advice I gave earlier— to broaden your 
readership if it not too difficult to do so— would dic- 
tate that technical arguments should be accompanied 
by informal explanations. 

3 Giving Full Detail versus 
Leaving Details to the Reader 

When you are writing you need to decide how much 
detail to give. If you give too little, then the reader 
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you are aiming at will not be able to understand what 
you have written. But you also want to avoid making 
too many points that the reader will find completely 
obvious. 

Of these two potential problems, the first is undoubt- 
edly more serious. It is much easier for readers to skip 
details that they find too obvious to be worth saying 
than it is for them to fill in details that they do not find 
obvious at all. 

The question of how much detail to give is related 
to the question of how formal to be, but it is not the 
same question. It is true that there is a tendency in 
informal mathematical writing to leave out details, but 
with even the most formal writing a decision has to be 
made about how much detail to give; it is just that in 
formal writing one probably wants to signal more care- 
fully when details have been left out. This can be done 
in various ways. One can use expressions such as “It is 
an easy exercise to check that...,” or “The second case 
is similar,” which basically say to the reader, “I have 
decided not to spell out this part of the argument.” 
One can also give small hints, such as “By compact- 
ness,” or “An obvious inductive argument now shows 
that...,” or “Interchanging the order of summation and 
simplifying, we obtain....” 

If you do decide to leave out detail, it is a good idea to 
signal to the reader how difficult it would be to put that 
detail in. A mistake that some writers make is to give 
references to other papers for arguments that can eas- 
ily be worked out by the reader, without saying that the 
particular result that is needed is easy. This is straight- 
forwardly misleading; it suggests that the best thing to 
do is to go and look up the other paper when in fact the 
best thing to do is to work out the argument for oneself. 

4 Letters versus Words 

The following is problem 10 of book 1 of an English 
translation of Diophantus's Arithmetica. 

Given two numbers, to add to the lesser and to subtract 
from the greater the same (required) number so as to 
make the sum in the first case have to the difference in 
the second case a given ratio. 

A modern writer would express the same problem more 
like this. 

Given two numbers a and b and a ratio p, find x such 
that a + x = p(b - x). 

The main difference between these two ways of describ- 
ing the problem is that in the second formulation the 


numbers under discussion have been given names. 
These names take the form of letters, which allow us 
to replace wordy expressions such as “the second num- 
ber” or “the given ratio” by letters such as “b” and “p.” 

The advantage of modern notation is that it is much 
more concise. This is not just a matter of saving paper; 
the extra length of “to make the sum in the first case 
have to the difference in the second case a given ratio” 
over “such that a + x = p(b-x)" makes it significantly 
harder to understand because it is difficult to take in 
the entire phrase at once. 

However, the concision that comes from naming 
mathematical objects comes at a cost: one has to learn 
the names. In the example above, that is very easy 
and the cost is negligible. However, sometimes it is far 
from negligible. The following proposition comes from 
a paper in Banach space theory. 

Proposition. Let 0 ^ a ^ \ and 1/p = \ - tx. Then 
<p 2 (E,F) C £ { p a l(E,F) 

for all Banach spaces F if and only if E c r a . 

Just before the proposition, the reader has been told 
that <P 2 is the ideal of 2-summing operators from E to 
F, which is a standard definition in the area. As for 
£p%(E,F), this has been defined early in the paper 
as follows. (It is not necessary to understand these 
definitions to understand the point I am making.) 

Given an operator T, the approximation number a n (T) 
is defined to be inf{||T - L\\ : rank(I) < n}. Then 
£s*w(E,F) is the set of operators T such that the 
sequence (a n (T))“ =1 belongs to the Marcinkiewicz 
space f s ,w ■ 

The definition of the Marcinkiewicz space is again stan- 
dard in the area. Finally, the set F„ is defined to be the 
set of all Banach spaces of weak Hilbert type a. That 
is not a standard definition, but it is given earlier on in 
the paper. 

Thus, another way of stating the proposition is as 
follows. 

Proposition. Let 0 4 a < let 1/p = \ - a, and let E 
be a Banach space. Then the following two statements 
are equivalent. 

(1) For every Banach space F and every 2-summing 
operator T: E — • F, the sequence (a n (T))“ =1 of 
approximation numbers belongs to the Marcinkie- 
wicz space F p ,co. 

(2) E is a space of weak Hilbert type a. 
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This time, it is the wordier definition that places a 
smaller burden on the reader’s memory. If you know 
what 2-summing operators, approximation numbers, 
the Marcinkiewicz space, and weak Hilbert type are, 
most of which are standard definitions in the area, 
then you can understand without much effort what 
the proposition is claiming. With the first formulation, 
there is an extra step you have to perform to unpack 
the notation into those standard definitions. Another 
advantage of the second formulation is that a suffi- 
ciently expert reader who is skimming the paper will 
be able to understand it without having to look back in 
the paper to find out what everything means. The first 
formulation does not leave that option open. 

Thus, in more complicated mathematical writing, 
there is another source of tension. If you use too lit- 
tle notation, your sentences will become hopelessly 
clumsy and repetitive, but if you use too much, you 
are placing excessive demands on the memory of your 
readers. 

This may be a delicate balance to strike, but there is 
one principle that applies universally: if you do decide 
to use some nonstandard notation, then make sure that 
the reader can easily find where it is defined. This can 
be done by means of a section devoted to preliminary 
definitions, though it will often be kinder to give defi- 
nitions just before they are used. If that is not possible, 
one can give reminders of definitions, or at the very 
least pointers to where they can be found. 

5 Single Long Arguments versus 
Arguments Broken Up into Modules 

If you are trying to justify a mathematical statement 
and the justification is long and complicated, then 
what you write may well be hard to understand unless 
you can somehow break the argument up into smaller 
“modules” that fit together to give you what you want. 
In pure mathematics, these modules usually take the 
form of lemmas. If you are proving a theorem and 
you do not want the proof to become unwieldy, then 
you try to identify parts of the argument that can be 
extracted and proved separately. One can then simply 
quote these results in the main argument. Lemmas play 
a role in proofs that is similar to the role of subroutines 
in computer programs. 

For breaking up an argument to be a good idea, it 
greatly helps if the part of the argument you want to 
extract is not too context dependent. If the statement of 
a lemma requires a long piece of scene setting, then it is 


probably better to leave it in the main body of the argu- 
ment, where the scene has already been set. However, if 
it can be stated without reference to the particular con- 
text, which usually means that it is more general than 
the particular application needed of it in the main argu- 
ment, then it is more appropriate to extract it. Again, 
this is a matter of judgment. 

A disadvantage of more modular arguments is that 
extracting lemmas, or more general modules, forces 
you to put them somewhere where they do not arise 
naturally. If you put them before the main argument, 
so that they will be available when needed, then the 
reader is presented with statements of no obvious use 
and is expected to remember them. If they are particu- 
larly memorable, then that is not a problem, but often 
they are not; for instance, they may depend on two or 
three slightly odd conditions that just happen to be 
satisfied in the later argument. If you put them after 
the main argument, then the reader keeps being told, 
“We will prove this claim later” and reaches the end 
of the argument with the uneasy feeling that the proof 
is incomplete. A third possibility is to state and prove 
lemmas within an argument, but nested statements of 
this kind can be fairly ugly. 

With some complicated arguments, there may be 
no truly satisfactory solution to these problems. In 
that case, the best thing to do may well be to choose 
an unsatisfactory solution and mitigate the problems 
somehow. The default option is probably to state lem- 
mas before they are used. If you choose that option 
and the lemmas are somewhat complicated and hard 
to remember, then you can always add a few words of 
explanation about the role that the lemma will play. If 
even that is hard to do, then another option is to advise 
the reader to read the main argument first and return to 
the lemma only when the need for it has become clear. 
(An experienced reader may well do that anyway, but it 
is still helpful to be told by the author that it is a good 
approach to understanding the argument.) 

6 Logical Order versus Order of Discovery 

Suppose you wish to present the fact that a sequence 
of continuous functions that converges pointwise does 
not have to converge uniformly. Here is one way that 
you might do it. 

Theorem. There exists a sequence of continuous func- 
tions f n : [0,1] -> [0,1] that converges pointwise but 
not uniformly. 
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Proof. For each positive integer n and each x e [0,1], 
let f n (x) = nxe~” x . Then, for x > 0 we have e x < 1, 
so ne~ nx = n(e~ x ) n — 0 as n — ■ oo. It follows that 
f n (x ) — Oasn — ■ oo. Also, whenx = Owe have / n (x) = 
0 for every n, so again f n (x) — ■ 0 as n — ■ oo. Therefore, 
f n (x) — 0 pointwise. 

However, the convergence is not uniform. To see this, 
observe that f n (n ~ l ) = e -1 for every n. Thus, for every 
n there exists x such that |/ n (x) - 0| ^ e -1 . □ 

This proof has a feature that is common in mathe- 
matics: it is easier to follow the steps than it is to see 
where the steps came from. If you are told to try the 
functions /„(x) = nxe~ nx , then checking that they 
satisfy the conditions is a straightforward exercise, but 
what made anybody think of that particular sequence 
of functions? 

Here is what we might write if we wanted to make the 
answer to that last question clearer. 

Proof. If f n — f pointwise but not uniformly, then 
fn~ f — 0 pointwise but not uniformly, so we may as 
well look for functions that converge to zero. In order 
to ensure that they do not converge uniformly to zero, 
we need a positive number 0 such that for infinitely 
many n there exists x e [0, 1] with \f n (x) | ^ 0. Since 
infinitely many of these /„ (x) will have the same sign, 
and since we can multiply all functions by 6W 1 , we may 
as well look for a sequence of functions /„ that con- 
verges pointwise to 0 such that for every n there exists 
x„ with fn(x n ) ^ 1. 

Now, if f n (x n ) ^ 1 and f„ is continuous, then there 
exists an open interval I n = (x n - 5 n ,x n + S n ) around 
x n such that f n (y) S? § for every y e I n . We are going 
to have to make sure that we do not have infinitely 
many of these intervals overlapping in some point u, 
since then we would have / n ( w ) ^ \ f or infinitely many 
n , which would imply that f n (u ) does not tend to zero. 

How can we find infinitely many open intervals with- 
out infinitely many of them overlapping? The sim- 
plest way of doing it is to take intervals of the form 
(a, b n ) for a sequence (b„) that converges to a. So, for 
example, we could take I n to be the interval (0, 1/n). 

This suggests that we should let f n be a continuous 
function that takes the value 1 somewhere inside the 
interval (0, 1/n) and is small outside that interval. One 
way of defining a function that reaches 1 for a small 
value of x and then quickly drops back down again is to 
take a function that grows rapidly to 1 , such as g n (x) = 
Ax, and multiply it by a function that is roughly 1 for a 


little while and then decays rapidly, such as e - ^*. The 
rapid decay of e^* starts when x is around 1/ p, which 
suggests that we should take p to be around n. Since 
we want g n (x) to reach 1 in the interval (0, 1/n), we 
should probably take A to be around n as well. 

It is now easy to check that the functions f n (x) = 
nxe~ HX converge pointwise to zero but not uniformly. 

□ 

Of course, one might well give a detailed proof that 
the functions nxe _ “ do the job. 

As with the other choices, there are advantages and 
disadvantages that need to be weighed up when decid- 
ing how much to explain the origin (or at least a possi- 
ble origin) of the ideas one presents. If one’s main con- 
cern is verification of a result — that is, convincing the 
reader of its truth— then it may not matter too much 
where the ideas come from as long as they work. But if 
the aim is to teach the reader how to solve problems of 
a certain kind, then presenting solutions that appear 
out of nowhere as if by magic is not helpful. What is 
more, demonstrating where the ideas come from gives 
the reader a much clearer idea of which features are 
essential and which merely incidental. For example, in 
the argument above it is clear from the second presen- 
tation that there is nothing special about the functions 
fn{x) = nxe~ nx : for f n (x) one could take any non- 
negative function such that /„( 0) = 0, f n (I/n) ^ c 
(for some fixed constant c), and f n (x) is small for 
every x ^ 2 /n. For instance, one could take a “witch’s 
hat” that equals nx when 0^x^ 1/n, 2 - nx when 
1/n ^ x ^ 2 /n, and 0 when 2/n ^ x ^ 1. 

That is not to say that a diligent reader cannot look 
at a presentation of the first kind and work out for 
him/herself where the idea might have come from. In 
this case, if one sketches the graph of / n (x), one sees 
that it grows and shrinks rapidly in a small interval near 
0 and is small thereafter, and then it becomes clear why 
these functions are suitable. However, one needs expe- 
rience to be able to do this with an argument. So the 
extent to which you should explain where your argu- 
ments come from depends largely on the level of expe- 
rience of your intended reader— both generally and in 
the specific area you are writing about. 

7 Definitions First versus Examples First 

Suppose that one wanted to write an explanation of 
what a topological manifold is. An obvious approach 
would be to start by giving the definition. That could 
be done as follows. 
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Definition. A d-dimensional topological manifold is a 
topological space X such that every point x in X has 
a neighborhood that is homeomorphic to a connected 
open subset of R d . 

Having done that, one would give a few examples 
of topological manifolds, such as spheres and tori, to 
illuminate the definition. 

An alternative approach is to start with a brief discus- 
sion of the examples. One could point out, for instance, 
that it is easy to come up with a satisfactory coordinate 
system for any small region of the world but that it is 
not possible to find a good coordinate system for the 
world in its entirety; there will always be annoying prob- 
lems such as the poles not having well-defined longi- 
tudes. A discussion of that kind will give the reader the 
informal concept of a space that is “locally like R d ” and 
after that the formal definition is motivated: it is the 
formal expression of an informal idea that the reader 
already has. 

The advantage of the second approach is that an 
abstract definition is often much easier to understand 
if one has a good idea of what it is abstracting. One will 
read the definition with strong expectations of wiiat it 
will look like, and all one will have to commit to mem- 
ory is the ways in which the definition does not quite fit 
those expectations. If the definition is presented first, 
then one will be expected to hold the wirole thing in 
one’s head, rather than what one might think of as 
the difference between the definition and one’s prior 
expectation of it. 

Whether or not this advantage makes it worth pre- 
senting examples before giving a definition depends on 
how difficult you expect it to be for your reader to grasp 
the definition. To give an example where it might not be 
worth giving examples first, suppose that you want to 
introduce the notion of a commutative ring for a reader 
who is already familiar with groups and fields. A natural 
way of doing it would be to list the axioms for a com- 
mutative ring and make the remark that what you have 
listed is very similar to the list of axioms for a held but 
you no longer assume that elements have multiplica- 
tive inverses, and sometimes you do not even assume 
that your rings have multiplicative identities. 

Once you have said that, it will still be a very good 
idea to give some important examples, such as the ring 
Z of all integers, the ring Z[x\ of all polynomials with 
integer coefficients, and the ring Z[V2] of all numbers 
of the form a + b-j2 where a and b are integers. How- 
ever, the argument for presenting these examples first 


is weaker than it was for topological manifolds, for two 
reasons. 

The first reason is that the definition is easy to grasp: 
rings are like fields but without multiplicative inverses. 
Therefore, giving the definition straight away does not 
place a burden on the reader’s memory. Of course, the 
reader will want reassurance that there are interesting 
examples, but that can be given immediately after the 
definition. 

The second reason is that the necessity for this par- 
ticular abstraction is less obvious than it is for man- 
ifolds. Given examples such as spheres and tori, it is 
natural to think that they are all examples of the same 
basic “thing” and then to try to work out what that 
“thing” is. But the benefits of thinking of the integers 
and the polynomials with integer coefficients as exam- 
ples of the same underlying algebraic structure are not 
clear in advance; they become clear only after one has 
developed a considerable amount of theory. So it is 
more natural in this case to think of the abstraction 
as primary, at least in the first instance. 

As ever, the decision about how to present a new 
mathematical concept involves a judgment that is 
sometimes quite delicate. Broadly speaking, the harder 
a definition is to grasp, the more helpful it will be to 
the reader to have some examples in mind when read- 
ing it. But that depends both on the reader and on 
the intrinsic complexity of the definition. However, one 
general piece of advice is still possible here, which is at 
least to consider the possibility of starting with exam- 
ples. It may not always be appropriate to do so, but 
many mathematical writers like to start with definitions 
under all circumstances, and the result is that many 
expositions are harder to understand than they need 
to be. 

Let me close this section by pointing out that the 
examples-first device is quite a general one. Indeed, 1 
have used it in a number of places in this article; see the 
openings of sections 4 and 6 and of this very section. 

8 Traditional Methods of Disse m ination 
versus New Methods 

A person who wishes to produce mathematical writing 
today faces a choice that did not exist twenty years ago. 
Until recently, almost all mathematical writing took the 
form of books or journal articles. But now the Internet 
has given us new? methods of dissemination, which have 
already had an impact and are likely to have a much 
bigger impact in the future. 
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In particular, the existence of the Internet affects 
every single one of the considerations discussed in this 
article. Let me take them in turn. 

8. 1 Level of Formality 

The main task of each generation of mathematicians is 
to add to the body of mathematical knowledge. How- 
ever, there is a second task that is almost as important 
as the first, and not entirely separate from it, which is 
to digest this new knowledge and present it in a form 
that subsequent generations will find as easy as pos- 
sible to grasp. This process of digestion can of course 
happen many times to the same piece of mathematics. 

Sometimes, digesting a piece of mathematics is itself 
a significant advance in mathematical knowledge. For 
example, a theory may be developed that yields quite 
easily a number of already existing and seemingly dis- 
parate results. The traditional publication system is 
well suited to this situation; one can just write an arti- 
cle about the theory and get it published in the normal 
way. 

Sometimes, however, digesting a piece of mathemat- 
ics does not constitute a mathematical advance. It can 
be something more minor, such as thinking of a way 
of looking at an argument that makes it clearer where 
the ideas have come from or drawing an informal anal- 
ogy between one piece of mathematics and another 
that is simpler or better known. Insights of this kind 
can be hard earned and extremely valuable to other 
mathematicians, but they do not lead to publishable 
papers. 

With the Internet, there are many ways that more 
informal mathematical thoughts can be shared. An 
obvious one is to write a conventional mathematical 
text and make it available on one’s home page. Another 
option, which an increasing number of mathematicians 
have adopted, is to have a blog. The advantage of this 
is that one obtains feedback from one’s readers, and 
experience has shown that the quality of much of this 
feedback is very high. 

There are other forms of mathematical literature that 
would not be conventionally publishable but that could 
be extremely valuable. For example, an article about a 
serious but failed attempt to solve a problem would not 
be accepted by a journal, and the result is a great deal of 
duplication of work; if the problem is important and the 
attempt looks plausible to begin with, then many peo- 
ple will try it. A database of failed proof attempts would 
be very useful, and in principle the Internet makes it 
easy to set up, though so far nobody has done so. 


In general, the Internet allows us much greater free- 
dom in choosing the level of formality at which we wish 
to write and allows us to publish documents that do not 
fit the mould of a standard journal article. 

8.2 Level of Detail 

Suppose that you use a mathematical result or defini- 
tion that will be familiar to some readers but not to oth- 
ers. In a print document you have to decide whether to 
explain it and, if so, how elaborate an explanation to 
give. 

In a hyperlinked document on the web, one is no 
longer forced to make this choice. One can write a 
version for experts, but with certain key words and 
phrases underlined, so that readers who need these 
words and phrases explained further can click on them 
and read explanations. This kind of writing has become 
very common on Wikipedia and other wikis. 

It also introduces a new balance that needs to 
be struck. Sometimes wiki articles are hard to read 
because the writers use the existence of links to other 
wiki pages as a license not to explain terms that they 
might otherwise have explained. The result is that 
unless one is familiar with most of the definitions in 
the original article, one can get lost in a complicated 
graph of linked wiki pages as one finds that the page 
that explains an unfamiliar concept itself requires one 
to click through to several other pages. So if you are 
going to exploit hyperlinks, you need to think carefully 
about what the experience of following those links will 
be like for your intended readers. 

Another inconvenience of hyperlinks is that they 
require you to visit an entirely new page, which makes 
it easy to forget where you were before (especially if 
you backtrack and then follow some other sequence 
of links). However, there is plenty of software that gets 
round this problem. For example, on some sites one can 
incorporate “sliders,” pieces of text that insert them- 
selves into what you are reading when you click on 
an appropriate box and disappear when you click on 
it again. So if, for example, one wrote, “by the second 
isomorphism theorem,” one could have a box with the 
words “What does that say?” on it, so that readers who 
needed it could click on the box and have a short para- 
graph about the second isomorphism theorem inserted 
into the text. One can have sliders within sliders, so 
perhaps within that slider one could have the option of 
bringing up a proof of the theorem as well. 

The main point is that the Internet has made it pos- 
sible to write new kinds of documents where one is no 
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longer forced to make choices such as of how much 
detail to give. One can leave that decision to the reader. 
Such documents have a huge potential to improve the 
way mathematics is presented, and this potential will 
only increase as technology improves. 

8.3 Letters versus Words 

I will not say much about this, since most of what I have 
to say is very similar to what I have already said about 
the level of detail in which a document is written. With 
the kinds of electronic documents that are now possi- 
ble, one can save the reader the trouble of searching 
through a paper to find out what a letter stands for by 
incorporating a reminder that appears when you click 
on the letter. Perhaps better still, it could appear in a 
little box when you hover over the letter. One could 
also have condensed statements involving lots of let- 
ters with the option of converting them into equivalent 
wordier statements. Again, the point is that there are 
many more options now. 

8.4 Modularity 

The kinds of electronic documents I have been dis- 
cussing make possible a form of top-down mathemat- 
ical writing that would be far less convenient in a 
print document. One could write a high-level account 
of some piece of mathematics, giving the reader the 
option of expanding any part of that high-level account 
into a lower-level account that justifies it in more detail. 
And there could be many levels of this, so that if 
you clicked on everything you would end up with a 
presentation of the entire argument in full gory detail. 

A less ambitious possibility is one that solves the 
problem discussed earlier about where to place a 
lemma. The difficulty was that in a print document you 
will either put it before the proof where it is used, in 
which case it is not adequately motivated, or during the 
proof, in which case it looks ugly, or after the proof, 
in which case the proof itself leaves you with awkward 
promises to fill in gaps later. But with an electronic doc- 
ument, putting a lemma exactly where it is needed is no 
longer ugly. During the proof, one can say, “We are now 
going to make use of the following statement,” and give 
the reader a button to click on that will bring up a proof 
of that statement. 

8.5 Order of Presentation 

If you do not want to decide whether to give an abstract 
definition first or start with motivating examples, then 


you can give the reader the choice. Just start with a page 
of headings and invite the reader to decide whether 
to click on “Motivating examples” first or “The formal 
definition” first. 

To some extent, the same goes for the decision about 
whether to present arguments in their logical order or 
in a way that brings out how they were discovered. If 
at some point the logical order requires you to draw a 
rabbit out of a hat, you could at the very least introduce 
a slider that explains where that rabbit actually came 
from. 


VIII.2 How to Read and Understand a 
Paper 

Nicholas J. Higham 


Whether you are a mathematician or work in another 
discipline and need to use mathematical results, you 
will need to read mathematics papers— perhaps lots 
of them. The purpose of this article is to give advice 
on how to go about reading mathematics papers and 
gaining understanding from them. 

The advice is particularly aimed at inexperienced 
readers. A professional mathematician may read from 
tens to hundreds of papers every year, including pub- 
lished papers, manuscripts sent for refereeing by jour- 
nals, and draft papers written by students and col- 
leagues. To a large extent the suggestions I make 
here are ones that you naturally adopt after reading 
sufficiently many papers. 

Mathematics papers fall into two main types: primary 
research papers and review papers. Review papers give 
an overview of an area and usually contain a substantial 
amount of background material. By design they tend to 
be easier to read than papers presenting new research, 
although they are often longer. The suggestions in this 
article apply to both types of papers. 

1 The Anatomy of a Paper 

Mathematics papers are fairly rigid in format, having 
some or all of the following components. 

Title. The title should indicate what the paper is about 
and give a hint about the paper’s contributions. 
Abstract. The abstract describes the problem being 
tackled and summarizes the contributions of the 
paper. The length and the amount of detail both 
vary greatly. The abstract is meant to be able to 
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stand alone. Often it is visible to everyone on a jour- 
nal’s Web site, while the paper is visible only to 
subscribers. 

Introduction. The first section of the paper, almost 
always called “Introduction,” sets out the context 
and problem being addressed in more detail than 
the abstract. Depending on how the paper has been 
written, the introduction may or may not describe 
the results and conclusions. Some papers lend them- 
selves to a question being posed in the introduction 
but fully answered only in a conclusions section. 

Conclusions. Many, but not all, papers contain a final 
section with a title such as “Conclusion” or “Conclud- 
ing Remarks” that summarizes the main conclusions 
of the paper. Omission of such a section indicates 
that the conclusions have been stated in the intro- 
duction or perhaps at the end of a section describing 
experiments, or that no explicit summary has been 
provided. This section is often used to identify open 
questions and describe areas for future research, and 
such suggestions can be very useful if you are looking 
for problems to work on. 

Appendix. Some papers contain one or more appen- 
dices, which contain material deemed best sepa- 
rated from the main paper, perhaps because it would 
otherwise clutter up the development or because it 
contains tedious details. 

References. The references section contains a list of 
publications that are referred to in the text and that 
the reader might want to consult. 

Supplementary materials. A relatively new concept in 
mathematics is the notion of additional materials 
that are available on the publisher’s Web site along 
with the paper but are not actually part of the paper. 
These might include figures, computer programs, 
data, and other further material and might not have 
been refereed even if the paper itself has. It is not 
always easy to tell if a paper has supplementary 
materials, as different journals have different con- 
ventions for referring to them. They might be men- 
tioned at the end of the paper or in a footnote on 
the first page, and they may be referred to with “see 
the supplementary materials” or via an item in the 
reference list. 

2 Deciding Whether to Read a Paper 

A common scenario is that you come across a paper 

that, based on the title, you think you might need to 

read. For example, you may be signed up to receive 


alerts from a journal or search engine and become 
aware of a new paper on a topic related to your inter- 
ests. How do you decide whether to read the paper? 
The abstract should contain enough information about 
the context of the work and the paper’s results for 
you to make a decision. However, abstracts are some- 
times very short and are not always well written, so it 
may be necessary to skim through the introduction and 
conclusions sections of the paper. 

The reference list is worth perusing. If few of the 
references are familiar, this may mean that the paper 
presents a rather different view on the topic than you 
expected, perhaps because the authors are from a dif- 
ferent field. If papers that you know are relevant are 
missing, this is a warning that the authors may not be 
fully aware of past work on the problem. 

If the main results of the paper are theorems, read 
those to see whether it is worth spending further time 
on the paper. Consider also the reputation of the jour- 
nal and the authors, and, unless the paper is very 
recent, check how often (and how, and by whom) it has 
been cited in order to get a feel for what other people 
think about it. (Citations can be checked using online 
tools, such as Google Scholar or one of several other 
services, most of which require a subscription.) 

3 Getting an Overview 

A paper does not have to be read linearly. You may 
want to make multiple passes, beginning by reading 
the abstract, introduction, and conclusions, as well as 
looking at the tables, figures, and references. 

Many authors end the introduction with a paragraph 
that gives an overview of what appears in each part of 
the paper. Sometimes, though, a glance at the paper’s 
section headings provides a more easily assimilated 
summary of the content and organization. 

Another way in which you might get an overview of 
the paper is by reading the main results first: the lem- 
mas, theorems, algorithms, and associated definitions, 
omitting proofs. The usefulness of this approach vail 
depend on the topic and your familiarity with it. 

4 Understanding 

It is often hard to understand what you are reading. 
After all, research papers are meant to contain original 
ideas, and ideas that you have not seen before can be 
hard to grasp. You may want to stop and ponder an 
argument, perhaps playing with examples. 
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I strongly recommend making notes, to help you 
understand the text and avoid having to retrace your 
steps in grasping a tricky point if you come back to 
the paper in the future. It is also a good idea to write a 
summary of your overall thoughts on the paper; when 
you go back to the paper a few months or years later, 
your summary will be the first thing to look at. I rec- 
ommend dating your notes and summary, as in the 
future it can be useful to know when they were writ- 
ten. Indeed, I have papers that I have read several times, 
and the notes show how my understanding changed on 
each reading. (There exist papers for which multiple 
readings are needed to appreciate fully the contents, 
perhaps because the paper is deep, because it is badly 
written, or both!) 

As well as writing notes, it is a good idea to mark 
key sentences, theorems, and so on. I do this either by 
putting a vertical line in the margin that delineates the 
area of interest or by marking the relevant text with a 
highlighter pen. 

I write my notes on a hard copy of the paper. Many 
programs are available that will allow you to annotate 
PDF files on-screen, though using mathematical nota- 
tion may be problematic; one solution is to handwrite 
notes and then scan them in and append them to the 
PDF file. 

A good exercise, especially if you are inexperienced 
at writing papers, is to write your own abstract for the 
paper (100-200 words, say). 

Writing while you read turns you from a passive 
reader into an active one, and being an active reader 
helps you to understand and remember the contents. 
One useful technique is to try out special cases of 
results. If a theorem is stated for analytic functions, 
see what it says for polynomials or for the exponen- 
tial. If a theorem is stated for n x n matrices, check 
it for n = 1,2,3. Another approach is to ask yourself 
what would happen if one of the conditions in a theo- 
rem were to be removed: where would the proof break 
down? 

When you reach a point that you do not understand, 
it may be best to jump to the end of the argument and 
go back over the details later to avoid getting bogged 
down. Keep in mind that some ideas and techniques 
are so well known to researchers in the relevant field 
that they might not be spelled out. If you are new to 
the field you may at first need a bit of help from a more 
experienced colleague to fill in what appear to be gaps 
in arguments. 


It is important to keep in mind that what you 
are reading may be badly explained or just wrong. 
Typographical errors are quite common, especially in 
preprints and in papers that have not been copy edited. 
Mathematical errors also occur, and even the best jour- 
nals occasionally have to print corrections (“errata”) to 
previously published articles. 

In mathematical writing certain standard phrases are 
used that have particular meanings. “It follows that” or 
“it is easy to see that” mean that the next statement 
can be proved without using any new ideas and that 
giving the details would clutter the text. The detail may, 
however, be tedious. The shorter “hence,” “therefore,” 
or “so” imply a more straightforward conclusion. “It 
can be shown that” again implies that details are not 
felt to be worth including but is noncommittal about 
the difficulty of the proof. 

5 Documenting Your Reading 

I advise keeping a record of which papers you have 
read, even if you have read them only partially. If you 
are a beginning Ph.D. student this may seem unnec- 
essary, as at first you will be able to keep the papers 
in your mind. But at some point you will forget which 
papers you have read and having this information 
readily available will be very useful. 

A few decades ago papers existed only as hard copies, 
and one would file them by author or subject. Today, 
most papers are obtained as PDF downloads that can be 
stored on our computers. Various computer programs 
are available for managing collections of papers. One 
of those, or a BibTjX database, can serve to record what 
you have read and provide links to the PDF files. 

6 Screen or Print? 

Should you read papers on a computer screen or in 
print form? This is a personal choice. People brought 
up in the digital publishing era may be happy read- 
ing on-screen, but others, such as me, may feel that 
they can properly read a paper only in hard copy form. 
There is no doubt that hard copy allows easier view- 
ing of multiple pages at the same time, while a PDF 
file makes it easier to search for a particular term 
and can be zoomed to whatever size is most comfort- 
able to read. It is important to try both and use what- 
ever combination of screen and print works best for 
you. 
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If you do read on-screen, keep in mind that most PDF 
readers allow you to customize the colors. White or yel- 
low text on a black background may be less strain on 
the eyes than the default black on white. In Adobe Acro- 
bat the colors can be changed with the menu option 
Preferences-Accessibility-Document Colors Options. 

7 Reading for Writing 

One of the reasons to read is to become a better writer. 
When you read an article that you think is particu- 
larly well written, analyze it to see what techniques, 
words, and phrases seemed to work so well. Reading 
also expands your knowledge and experience, and can 
improve your ability to do research. Donald Knuth put 
it well when he said: 

In general when I’m reading a technical paper . . . I’m 
trying to get into the author’s mind, trying to figure 
out what the concept is. The more you learn to read 
other people’s stuff, the more able you are to invent 
your own in the future. 

8 What Next? 

Having read the paper you should ask yourself not only 
what the authors have achieved but also what questions 
remain. Can you identify open questions that you could 
answer? Can you see how to combine ideas from this 
paper with other ideas in a new way? Can you obtain 
stronger or more general results? 


VIII.3 How to Write a General Interest 
Mathematics Book 

Ian Stewart 

I’ve always wanted to write a book. 

Then why don’t you ? 

—Common party conversation 

Popular science is a well-developed genre in its own 
right, and popular mathematics is an established sub- 
genre. Several hundred popular mathematics books 
now appear every year, ranging from elementary intro- 
ductions through school-level topics to substantial vol- 
umes about research breakthroughs. Writing about 
mathematics for the general public can be a rewarding 
experience for anyone who enjoys and values commu- 
nication. Established authors include journalists, teach- 
ers, and research mathematicians; subjects are limited 
only by the imagination of authors and publishers’ 
assessments of what booksellers are willing to stock. 


Even that is changing as the growth of e-books opens 
the way for less orthodox offerings. The style may be 
serious or lighthearted, preferably avoiding extremes 
of solemnity or frivolity. On the whole, most academic 
institutions no longer look down on “outreach” activi- 
ties of this kind, and many place great value on them, 
both as publicity exercises and for their educational 
aspects. So do government funding bodies. 

1 What Is Popular Mathematics? 

For many people the phrase is an oxymoron. To them, 
mathematics is not popular. Never mind: populariza- 
tion is the art of making things popular when they were 
not originally. It is also the art of presenting advanced 
material to people who are genuinely interested but 
do not have the technical background required to read 
professional journals. Generally speaking, most popu- 
lar mathematics books address this second audience. It 
would be wonderful to write a book that would open up 
the beauty, power, and utility of mathematics to people 
who swore off the subject when they were five, hate it, 
and never want to see it or hear about it again — but, by 
definition, very few of them would read such a book, so 
you would be wasting your time. 

Already we see a creative tension between the wishes 
of the author and the practicalities of publishing. As e- 
books start to take off, the whole publication model is 
changing. One beneficial aspect is that new kinds of 
book start to become publishable. If an e-book fails 
commercially, the main thing wasted is the author’s 
time and energy. That may or may not be an issue— an 
author with a track record can use his/her time to bet- 
ter effect by avoiding things that are likely to fail— but 
it will not bankrupt the publisher. 

Popular mathematics is a genre, a specific class of 
books with common features, attractive to fans and 
often repellent to everybody else. In this respect it is 
on a par with science fiction, detective novels, roman- 
tic fantasy, and bodice rippers. Genres have their own 
rules, and although these rules may not be explicit, fans 
notice if you break them. If you want to write a popu- 
lar mathematics book, it is good preparation to read a 
few of them first. Many writers started that way; they 
began as fans and ended up as authors, motivated by 
the books they enjoyed reading. 

Most popular mathematics books fall into a relatively 
small number of types. Many fall into several simulta- 
neously. The rest are as diverse as human imagination 
can make them. The main classifiable types are: 
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(1) Children’s books. 

(a) Basic school topics. 

(b) More exciting things. 

(2) History. 

(3) Biography. 

(4) Fun and games. 

(5) Big problems. 

(6) Major areas. 

(a) Classical. 

(b) Modern. 

(7) Applications. 

(8) Cultural links. 

(9) Philosophy. 

Children’s books are a special case. They involve many 
considerations that are irrelevant to books for adults, 
such as deciding which words are simple enough to 
include— I will say no more about them, lacking expe- 
rience. Histories and biographies have a natural advan- 
tage over more technical types of books: there is gen- 
erally plenty of human interest. (If not, you chose the 
wrong topic or the wrong person.) Books about fun 
and games are lighthearted, even if they have a more 
serious side. Martin Gardner was the great exponent 
of this form of writing. Authors have written in depth 
about individual games, while others have compiled 
miscellanies of “fun” material. 

Big problems, and major areas of mathematical 
research, are the core of popular mathematics. Exam- 
ples are Fermat’s last theorem, the Poincare conjec- 
ture, chaos, and fractals. To write about such a topic 
you need to understand it in more depth than you 
will reveal to your readers. That will give you confi- 
dence, help you find illuminating analogies, and gen- 
erally grease the expository wheels. You should choose 
a topic that is timely, has not been exhausted by oth- 
ers, and stands some chance of being explained to a 
nonexpert. Within both of these subgenres you can 
present significant topics from the past— Fourier analy- 
sis, say— or you can go for the latest hot research area- 
wavelets, maybe. It is possible to combine both if there 
is a strong historical thread from past to present. 

It is always easier to explain a mathematical idea 
if it has concrete applications. People can relate to 
the applications when the mathematics alone starts 
to become impenetrable. The same goes for cultural 
links, such as perspective in Renaissance art and the 
construction of musical scales. 


Finally, there are deep conceptual issues, “philosoph- 
ical” aspects of mathematics: infinity, many dimen- 
sions, chance, proof, undecidability, computability. 
Even simple ideas like zero or the empty set could form 
the basis of a really fascinating book, and have done 
so. The main thing is to have something to say that is 
worth saying. That is true for all books, but it is espe- 
cially vital for philosophical ones, which can otherwise 
seem woolly and vague. 

2 Why Write a Popular Mathematics Book? 

Authors write for many reasons. When Frederik Pohl, 
a leading science fiction writer, was being inducted 
into the U.S. military he was asked his profession and 
replied “writer.” This was received with a degree of con- 
cern; writers are often impractical idealists who criti- 
cize everything and cause trouble. So Pohl was asked 
why he wrote. “To make money,” he replied. This was 
received with relief as an entirely sensible and com- 
prehensible reason. Another leading American writer, 
Isaac Asimov, produced more than 300 books of sci- 
ence fiction, popular science, and other genres. When 
asked why he wrote so many books, he replied that he 
found it impossible to stop. He also went out of his 
way not to recommend his prolific approach to anyone 
else. Some authors write one book and are satisfied, 
even if it becomes a best seller; others keep writing 
whether or not they receive commercial success. Some 
write one book and vow never to repeat the experience. 
Some want to put a message across that strikes them 
as being of vital importance — a new area of mathemat- 
ics, a social revolution, a political innovation. Some just 
like writing. There are no clear archetypes, no hard and 
fast rules; everything is diverse and fluid. 

The best reason for writing a popular mathematics 
book, in my opinion, is that you desperately want to tell 
the world about something you find inspiring and inter- 
esting. Books work better when the author is excited 
and enthusiastic about the topic. The excitement and 
enthusiasm will shine through of their own accord, and 
it is best not to be too explicit about them. Far too many 
television presenters seem to imagine that if they keep 
telling the viewers how excited they are, viewers will 
also become excited. This is a mistake. Do not tell them 
you are excited; show them you are. It is the same with 
a book. Tell the story, bring out its inherent interest, 
and you are well on the way. Popularization is not about 
making mathematics fun (interesting, useful, beautiful, 
...); it is about showing people that it already is fun 
(interesting, useful, beautiful, ...). 
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In the past I have called mathematics the Cinderella 
science. It does all the hard work but never gets to go to 
the ball. Our subject definitely suffers, compared with 
many, because of a negative image among a large sec- 
tor of the public. One reason for writing mathematics 
books for general readers is to combat this image. It 
is mostly undeserved, but as a profession, mathemati- 
cians do not always help their own cause. Even nowa- 
days, when some area of mathematics attracts atten- 
tion from the media, some mathematicians immedi- 
ately go into demolition mode and complain loudly 
about “hype,” exaggeration, and lack of precision. 

When James Gleick’s Chaos became a best seller, 
and the U.S. government noticed and began consid- 
ering increasing the funding for nonlinear dynamics, 
a few distinguished mathematicians went out of their 
way to inform the government that it was all nonsense 
and the subject should be ignored. This might have 
been a good idea if they had been right— not about the 
popular image of “chaos theory,” which was at best a 
vague approximation to the reality, but about the real- 
ity itself— but they were wrong. Nonlinear dynamics, of 
which chaos is one key component, is one of the great 
success stories of the late twentieth century, and it is 
powering ahead into the twenty-first. I remember one 
letter to a leading mathematics journal claiming that 
chaos and fractals had no applications whatsoever at 
a time when you could not open the pages of Science 
or Nature without finding papers that made excellent 
scientific use of these topics. Conclusion: many math- 
ematicians have no idea what is going on in the rest of 
the scientific world, perhaps because they do not read 
Science or Nature. 

To ensure that our subject is valued and supported, 
we mathematicians need to explain to ordinary people 
that mathematics is vital to their society, to their eco- 
nomic and social welfare, to their health, and to their 
children’s future. No one else is going to do it for us. But 
we will not succeed if no one is allowed to mention a 
manifold without explaining that it has to be Hausdorff 
and paracompact as well as locally Euclidean. We have 
to grab our audience’s attention with things they can 
understand; that necessarily implies using imprecise 
language, making broad-brush claims, and selecting 
areas that can be explained simply rather than others 
of equal or greater academic merit that cannot. 

I am not suggesting that we should mislead the pub- 
lic about the importance of mathematics. But whenever 
a scientific topic attracts public attention, its media 


image is seldom a true reflection of the technical real- 
ity. Provided the technical reality is useful and impor- 
tant, a bit of overexcitement does no serious harm. By 
all means try to calm it down but not at the expense of 
ruining the entire enterprise. Grabbing public attention 
and then leveraging that (as the bankers would say) into 
a more informed understanding is fine. Grabbing public 
attention and then self-destructing because a few fine 
points have not quite been understood is silly. 

3 Choosing a Topic 

An article on popular writing is especially appropri- 
ate in a companion to applied mathematics because 
most people understand mathematics better if they 
can see what it is good for. This is one reason why 
chaos and fractals have grabbed public attention but 
algebraic A-theory has not. This is not a value judg- 
ment; algebraic A-theory is core mathematics, hugely 
important— it may even be more important than non- 
linear dynamics. But there is little point in arguing their 
relative merits because each enriches mathematics. We 
are not obliged to choose one and reject the other. As 
research mathematicians, we probably do not want to 
work in both, but it is not terribly sensible to insist that 
your own area of mathematics is the only one that mat- 
ters. Think how much competition there would be if 
everyone moved into your area. This happens quite a 
lot in physics, and at times it turns the subject into a 
fashion parade. 

The key to most popular mathematics books is sim- 
ple: tell a story. 

In refined literary circles, the role of narrative is 
often downplayed. Whatever you think of Finnegans 
Wake, few would consider it a rip-roaring yarn. But 
popular science, like all genre writing, does not move 
in refined literary circles. What readers want, what 
authors must supply, is a story. Well, usually: all rules 
in this area have exceptions. In genre writing, humans 
are not Homo sapiens, wise men. They are Pan narrans, 
storytelling apes. Look at the runaway success of The 
Da Vinci Code— all story and little real sense. 

A story has structure. It has a beginning, a middle, 
and an end. It often involves a conflict and its eventual 
resolution. If there are people in it, that is a plus; it is 
what made Simon Singh's Fermat’s Last Theorem a best 
seller. But the protagonist of your book could well be 
the monster simple group or the four-color theorem. 
Human interest helps, in some subgenres, but it is not 
essential. 
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4 How to Write for the General Public 

I wish I knew. 

The standard advice to would-be authors is to ask, 
Who am I writing for? In principle, that is sound advice, 
but in practice there is a snag: it is often impossible 
to know. You may be convinced you are writing for 
conscientious parents who want to help their teenage 
children pass mathematics exams. The buying pub- 
lic may decide, by voting with their wallets, that the 
correct audience for your book is retired lawyers and 
bank managers who always regretted not knowing a 
bit more mathematics and have spotted an opportu- 
nity to bone up on the subject now that they have got 
the time. 

Some books are aimed at specific age groups— young 
children, teenagers, adults — and of course you need to 
bear the age of your readers in mind if you are writing 
that kind of book. But for the majority of popular math- 
ematics books, the audience turns out to be very broad, 
not concentrated in any very obvious demographic, and 
difficult to characterize. “The sort of people who buy 
popular math books” is about as close as you can get, 
so I do not think you should worry too much about your 
audience. 

The things that matter most are audience indepen- 
dent. Write at a consistent level. If the first chapter of 
your book assumes the reader does not know what a 
fraction is and chapter two is about p-adic cohomology, 
you may be in trouble. The traditional advice to col- 
loquium lecturers — start with something easy so that 
everyone can follow; then plough into the technicali- 
ties for the experts— was always bad advice even for 
colloquia because it lost most of the audience after five 
minutes of trivia. It is a complete disaster for a popular 
mathematics book. 

It is common for the level of difficulty to ramp up 
gradually as the book progresses. After all, your reader 
is gaining insight into the topic as your limpid prose 
passes through their eyeballs to their brain. Chapter 10 
ought to be a bit more challenging than chapter 1; if it 
is not, you are not doing your job properly. 

One useful technique— when writing anything, be it 
for the public or for the editors and readers of the 
Annals of Mathematics — is self-editing. You need to 
develop an editor's instincts and apply them to your 
own work. You can do it as you go, rejecting poor sen- 
tences before your fingers touch the keyboard, but 1 
find that slows me down and can easily lead to “writer’s 
block.” I hardly ever suffer from that affliction because 


I leave the Maoist self-criticism sessions for later. The 
great mathematical expositor Paul Halmos always said 
that the key to writing a book was to write it — however 
scruffily, however badly organized. When you have got 
most of it down on paper, or in the computer, you can 
go through the text systematically and decide what is 
good, what is bad, what is in the wrong place, what is 
missing, or what is superfluous. It is much easier to 
sort out these structural issues if you have something 
concrete to look at. 

Word processors have made this process much eas- 
ier. I generally write 10-1 5% more words than the book 
needs and then throw the excess away. I find it is 
quicker if I do that than if I agonize over each sen- 
tence as I type it. As George Bernard Shaw wrote: “I’m 
sorry this letter is so long, I didn’t have time to make it 
shorter.” When editing your work, here are a few things 
to watch out for. 

• If you are using a term that you have not explained 
already, and it seems likely to puzzle readers, find 
a place to set it up. It might be a few chapters back; 
it might be just before you use it. Whatever you do, 
do not put it in the middle of the thing you are using 
it for; that can be distracting: “Poincare conjec- 
tured that if a three-dimensional manifold (that is, 
a space . . . [several sentences] . . . three coordinates, 
that is, numbers that ... [several sentences] ... a 
kind of generalized surface) such that every closed 
loop can be shrunk to a point...” is unreadable. 

• If some side issue starts to expand too much, as you 
add layer upon layer of explanation, ask whether 
you really need it. David Tall and I spent ten years 
struggling to explain the basics of homology in a 
complex analysis book for undergraduates without 
getting into algebraic topology as such. Eventually 
we realized we could omit that chapter altogether, 
whereupon we finished the book in two weeks. 

• If you are really proud of the classy writing in some 
section, worry that it might be overwritten and dis- 
tracting. Cut it out (saving it in case you decide to 
put it back later) and reread the result. Did you 
need it? “Fine writing” can be the enemy of effective 
communication. 

• Above all, try to keep everything simple. Put your- 
self in the shoes of your reader. What would they 
want you to explain? Do you need the level of detail 
you have supplied? Would something less specific 
do the same job? Do not start telling them about 
the domain and range of a function if all they need 
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to know is that a function is a rale for turning one 
number into another. They are not going to take an 
exam. 

• Do not forget that many ideas that are bread and 
butter to you (increment cliche count) are things 
most people have never heard of. You know what a 
factorial is; they may not. You know what an equiv- 
alence relation is; most of your readers do not. 
They can grasp how to compose two loops, or even 
homotopy classes, if you tell them to run along one 
and then the other but not if you give them the 
formula. 

• It is said that the publishers of Stephen Hawking’s 
A Brief History of Time told him to avoid equa- 
tions, every one of which would allegedly “halve his 
sales.” To some extent this advice was based on the 
publisher’s hang-ups rather than on what readers 
could handle. Look at Roger Penrose’s The Road to 
Reality, a huge commercial success with equations 
all over most pages. However, Hawking’s publish- 
ers had a point: do not use an equation if you can 
say the same thing in words or pictures. 

Your main aims, to which any budding author of pop- 
ular science should aspire, are to keep your read- 
ers interested, entertain them, inform them, and— the 
ultimate — make them feel like geniuses. Do that, and 
they v\dll think that you are a genius. 

5 Technique 

Every author develops his or her own characteristic 
style. The style may be different for different kinds of 
books, but there is not some kind of universal house 
style that is perfect for a book of a given kind. Some 
people write in formal prose, some are more conversa- 
tional in style (my own preference), some go for excite- 
ment, some like to keep the story smooth and calm. 
On the whole, it is best to write in a natural style— one 
that you feel comfortable with— otherwise you spend 
most of the time forcing words into what for you are 
unnatural patterns when you should be concentrating 
on telling the story. 

There are, however, some useful guidelines. You do 
not need to follow them slavishly; the main point here 
is to illustrate the kinds of issues that an author should 
be aware of. 

Use correct grammar. If in doubt, consult a stan- 
dard reference such as Fowler’s Modern English Usage. 
Bear in mind that some aspects of English usage have 
changed since his day. Be aware that informal language 


is often grammatically impure, but even when writing 
informally, be a little conservative in that respect. For 
example, it is impossible nowadays to escape phrases 
like “the team are playing well.” Technically, “team” is 
singular, and the correct phrase is “the team is playing 
well.” Sometimes the technically correct usage sounds 
so pedantic and awkward that it might be better not 
to use it, but on the whole it is better to be correct. At 
all costs avoid being inconsistent: “the team is playing 
well and they have won nine of its last ten games.” 

I have some pet hates. “Hopefully” is one. It can be 
used correctly, meaning “with hope,” but much more 
often it is used to mean “I hope that,” which is wrong. 
“The fact that” is another: it is almost always a sign 
of sloppy sentence construction, and most of the time 
it is verbose and unnecessary. Replace “in view of the 
fact that” and similar phrases by the simple English 
word “because.” Try deleting “the fact” and leaving just 
“that.” If that fails to work, you can usually see an easy 
way to fix things up. Another unnecessarily convoluted 
phrase is “the way in which.” Plain “how” usually does 
the same job, better. 

Avoid Latin abbreviations: e.g., i.e., etc. They are obso- 
lete even in technical mathematical writing— Latin is 
no longer the language of science— and they certainly 
have no place in popular writing. Replace with plain 
English. The writing will be easier to understand and 
less clumsy. Avoid cliches (Wikipedia has links to lists), 
but bear in mind that it is impossible to avoid them 
altogether. Attentive readers will find some in this arti- 
cle: “bear in mind” for example! Stay away from crass 
ones like “run it up the flagpole,” and keep your cliche 
quotient small. 

Metaphors and analogies are great . . . provided they 
work. “DNA is a double helix, like two spiral staircases 
winding around each other” conveys a vivid image— 
although it is perilously close to being a cliche. Do not 
mix metaphors: I once wrote “the Galois group is a vital 
weapon in the mathematician’s toolkit” and a helpful 
editor explained that weapons belong in an armory and 
“vital” they are not. The human mind is a metaphor 
machine; it grasps analogies intuitively, and its demand 
for understanding can often be satisfied by finding an 
analogy that goes to the heart of the matter. A well- 
chosen analogy can make an entire book. A poor one 
can break it. 

Those of us with an academic background need to 
work very hard to avoid standard academic reflexes. 
“First tell them what you are going to tell them, then 
tell them, then tell them what you have told them” 
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is fine for teaching— and a gesture in that direction 
can help readers make sense of a chapter, or even 
an entire book— but the reflex can easily become for- 
mulaic. Worse, it can destroy the suspense. Imagine 
if Romeo and Juliet opened with “Behold! Here comes 
Romeo! He will take a sleeping potion and fair Juliet 
will think him dead and kill herself.” 

Some sort of road map often helps readers under- 
stand what you are doing. My recent Mathematics of 
Life opens with this: 

Biology used to be about plants, animals, and insects, 
but five great revolutions have changed the way scien- 
tists think about life. 

A sixth is on its way. 

The first five revolutions were the invention of 
the microscope, the systematic classification of the 
planet’s living creatures, evolution, the discovery of 
the gene, and the structure of DNA. Let’s take them in 
turn, before moving on to my sixth, more contentious, 
revolution. 

The reader, thus informed, understands why the next 
few chapters, in a book ostensibly about mathematics, 
are about historical high points of biology. Notice that 1 
did not tell them what revolution six is. It is mathemat- 
ics, of course, and if they have read the blurb on the 
back of the book they will know that, but you do not 
need to rub it in. 

Writing a popular science book is not like writing a 
textbook; it is closer to fiction. You are telling a story, 
not teaching a course. You need to think about pacing 
the story and about what to reveal up front and what to 
keep up your sleeve. Your reader may need to know that 
a manifold is a multidimensional analogue of a surface, 
and you may have to simplify that to “curved many- 
dimensional space.” Do not start talking about charts 
and atlases and C“ overlap maps; avoid all mention 
of “paracompact” and “Hausdorff.” To some extent, be 
willing to tell lies: lies of omission, white lies that slide 
past technical considerations that would get in the way 
if they were mentioned. Jack Cohen and I call this tech- 
nique “lies to children.” It is an educational necessity: 
what experts need to know is different from what the 
public needs to know. 

Above all, remember that you are not trying to teach 
a class. You are trying to give an intelligent but unin- 
formed person some idea of what is going on. 

6 How Do I Get My Book Published? 

Experienced writers know how to do this. If you are 
a new writer, it is probably better at present to work 


with a recognized publisher. However, this advice could 
well become obsolete as e-books grow in popularity. 
A publisher will bear the cost of printing the book, 
organize publicity and distribution, and deal with the 
publication process. In return it will keep most of the 
income, passing on about 10% to the author in the 
form of royalties. An alternative is to publish the book 
yourself through Web sites that print small quanti- 
ties of books at competitive prices. The stigma that 
used to be attached to “vanity publishing” is fast dis- 
appearing as more and more authors cut out the mid- 
dle man. You will have to handle the marketing, prob- 
ably using a Web site, but distribution is no longer 
a great problem thanks to the Internet. An even sim- 
pler method is to publish your work as an e-book. 
Amazon offers a simple publishing service; the author 
need do little more than register, upload a Word file, 
proofread the result, and click the “publish” button. 
At the moment the author gets 70% of all revenue. 
Many authors' societies are starting to recommend this 
method of publication. 

Assuming that you follow the traditional route, you 
will need to make contact with one or more publish- 
ers. An agent will know which publishers to approach 
and will generally make this process quicker and 
more effective. The agency fee (between 10% and 20%) 
is usually outweighed by improved royalty rates or 
advances, where the publisher pays an agreed sum that 
is set against subsequent royalties. The advance is not 
refundable provided the book appears in print, but you 
will not receive further payment until the book “earns 
out.” Advice about finding an agent (and much else) can 
be obtained from authors’ societies. In the absence of 
an agent, find out which publishers have recently pro- 
duced similar types of books. Send an outline and per- 
haps a sample chapter, with a short covering letter. If 
the publisher is interested, an editor will reply, typically 
asking for further information. This may take a while; 
if so, be patient, but not if it is taking months. 

If a publisher accepts your book, it will send a con- 
tract. It always looks official, but you should not hes- 
itate to tear it to pieces and scribble all over it. By 
all means negotiate with the publisher about these 
changes, but if you do not like it, do not sign it. Read the 
contract carefully, even if your agent is supposed to do 
this for you. Delete any clauses that tie your hands on 
future books, especially “this will be the author’s next 
book.” A few publishers routinely demand an option on 
your next work: authors should equally routinely cross 
out that clause. Try to keep as many subsidiary rights 
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(translations, electronic, serializations, and so on) as 
you can, but be aware that you may have to grant them 
all to the publisher before it agrees to accept your book. 
Some compromises are necessary. 

7 Organization 

Different authors write in different ways. Some set 
themselves a specific target of, say, 2000 words per day. 
Some write when they are in the mood and keep going 
until the feeling disappears. Some start at the front and 
work their way through the book page by page. Some 
jump in at random, writing whichever section appeals 
to them at the time, filling in the gaps later. 

A plan, even if it is a page of headings, is almost indis- 
pensable. Planning a book in outline helps you decide 
what it is about and what should go in it. Popular sci- 
ence books usually tell a story, so it is a good idea to 
sort out what the main points of the story are going to 
be. However, most books evolve as they are being writ- 
ten, so you should think of your plan as a loose guide, 
not as a rigid constraint. Be willing to redesign the plan 
as you proceed. Your book will talk to you; listen to 
what it says. 

To avoid writer’s block— a common malady in which 
the same page is rewritten over and over again, get- 
ting worse every time, until the author grids to a 
halt, depressed— avoid rewriting material until you 
have completed a rough draft of the entire book. You 
will have a far better idea of how to rewrite chap- 
ter 1 when you have finished chapter 20. When you 
are about halfway through, the rest is basically down- 
hill and the writing often gets easier. Do not put off 
that feeling by being needlessly finicky early on. Once 
you know you have a book, tidying it up and improv- 
ing it becomes a pleasure. Terry Pratchett called this 
“scattering fairydust.” 

8 What Else You Will Have to Do 

Your work is not finished when you submit the final 
manuscript (typescript, Word file, DTjX file, whatever). 
It will be read by an editor, who may suggest broad 
changes: “move chapter 4 after chapter 6,” “cut chap- 
ter 14 in half,” “add two pages about the origin of 
topology,” and so on. There will also be a copy edi- 
tor, whose main job is to prepare the manuscript for 
the printer and to correct typographical, factual, and 
grammatical errors. Some of them may suggest minor 
changes: “Why not put in a paragraph telling readers 


what a Mobius band is?” Many will change your punctu- 
ation, paragraphing, and choice of words. If they seem 
to be overdoing this — this is happening when you feel 
they are imposing their own style, instead of prepar- 
ing yours for the printer— ask them to stop and change 
everything back the way it was. It is your book. 

After several months proofs will arrive, to be read 
and corrected. Often the time allowed will be short. The 
publisher will issue guidelines about this process, such 
as which symbols to use and which color of ink. (Many 
still work with paper copy. Others will send a PDF file, 
or a Word file with “track changes” enabled, and explain 
how they want you to respond.) Avoid making further 
changes to the text, unless absolutely necessary, even 
if you have just thought of a far better way to explain 
what cohomology is. You may be charged for excessive 
alterations that are not typesetting errors — see your 
contract, about 25 sections in. 

With the proofs corrected, you might think that you 
have done your bit and can relax: you would be mis- 
taken. Nowadays, books are released alongside a bar- 
rage of publicity material: podcasts, webcasts, blogs, 
tweets, and articles for printed media like New Scientist 
or newspapers. Your publisher’s publicity department 
will try to get you on radio or television, and it will 
expect you to turn up for interviews if anyone bites. 
You may be asked to give public lectures, especially 
at literary festivals and science festivals. Your contract 
may oblige you to take part in such activities unless you 
have good reason not to or the demands become exces- 
sive. It is part of the job, so do not be surprised. If it 
seems to be getting out of hand, talk to your publisher. 
It won’t be in the publisher’s best interest to wear you 
out or take away too much time from writing another 
book it can publish. 

Always wanted to write a book? Then do so. Just be 
aware of what you are letting yourself in for. 


VIII.4 Workflow 

Nicholas J. Higham 


Workflow refers to everything involved in producing a 
mathematical paper other than the actual research. It is 
about the practicalities of how to do things, including 
how best to use different kinds of software for differ- 
ent tasks. The characteristics of a good workflow are 
that it allows the end result to be achieved efficiently, 
repeatably, and in a way that allows easy recovery from 
mistakes. 



VIII.4. Workflow 


913 


1 Typesetting: TjX and ETgX 

In the days before personal computers, articles would 
be handwritten, then typed on a typewriter by a secre- 
tary, and ultimately typeset by a publisher. Nowadays, 
almost every author prepares the article herself or him- 
self on a computer, and the publisher works from the 
author’s files. In many areas of academia it is the cus- 
tom to use Microsoft Word, or an open-source equiv- 
alent. In mathematics, computer science, and physics 
BTgX has become the de facto standard. 

TjX is a typesetting system invented by Donald Knuth 
in the late 1970s that has a particular strength in han- 
dling mathematics. BTgX is a macro package, written 
originally by Leslie Lamport, that sits on top of TgX. A 
TgX or KTjX hie is an ASCII (plain text) hie that contains 
commands that specify how the output is to be format- 
ted, and it must be compiled to produce the hnal output 
(nowadays usually a Portable Document Format (PDF) 
hie). This contrasts with a WYSIWYG (“what you see is 
what you get”) word processor, such as Microsoft Word, 
that displays on the screen a representation of what 
the output will look like. TjX allows hner control than 
word processors (the latter are sometimes described as 
“what you see is all you get”), and the ability of ETgX to 
use style hies that set various typesetting parameters 
makes it very easy to adjust the format of an article 
to match a particular journal. fflpX is also well suited 
to large projects such as books. Indeed, this volume is 
typeset in KTjX, and the editors and production editor 
hnd it hard to imagine having produced the volume in 
any other way. 

Figure 1 shows some ETjX source code. Although how 
the code is formatted makes no difference to the out- 
put, it is good practice to make the source as readable 
as possible, with liberal use of spaces. I like to start 
new sentences on new lines, which makes it easier to 
cut and paste them during editing. 

TjX and KTjX are open-source software and are avail- 
able in various distributions. In particular, the TjX Live 
distribution is available for Windows, Linux, and Mac 
systems (and as the augmented MacTgX for the latter). 

How does one go about using KTjX? There are two 
approaches. The first is to edit the ILIj-X source in a 
general-purpose text editor such as Emacs, Vim, or a 
system-specific text editor. Ideally, the editor is cus- 
tomized so that its syntax highlights the ETgX source, 
can directly compile the document, can pinpoint the 
location of compilation errors in the source, and can 
invoke a preview of the compiled document with two- 


Polynomial s are one of the simplest and most 
familiar classes of functions and they find 
wide use in applied mathematics. 

A degree $n$ \py\ 

$$ 

p_n(x) = a_0 + a_l x + \cdots + a_n x“n 

$$ 

is defined by its $n+l$ coefficients 
$a_0 ,\dots , a_n \in \C$ (with $a_n \ne 0$). 

Figure 1 ETfX source for part of the language of applied 
mathematics [1.2 §14]. \py and \C are user-defined macros. 
\cdots and \dots are built-in TjX macros. Dollar signs 
delimit mathematics mode. 

way synchronization between the location of the cur- 
sor in the source and the page of the preview. I use 
Emacs together with the AUCTjA and RefTgX packages, 
which provides an extremely powerful KTjX environ- 
ment; indeed I use Emacs for all my editing tasks, rang- 
ing from programming to writing emails. A popular 
alternative is to use a program designed specifically for 
editing ETgX documents, which typically comes with an 
integrated previewer. Such programs tend to be system 
specific. 

TjX compiles to its own DVI (device independent) file 
format, which can then be translated into PostScript, 
a file format commonly used for printing. The stan- 
dard format for distributing documents is now PDF. 
While PostScript can be converted to PDF, versions of 
TgX and BTjgX that compile directly to PDF are available 
(typically invoked as pdftex and pdf 1 atex). Whether 
one is using the DVI-based or PDF-based versions of 
BTjX affects how one generates graphics hies for inclu- 
sion in figures. For DVI, included figures are typically 
in encapsulated PostScript format, whereas for PDF 
graphics hies are typically in PDF or JPEG format and 
PostScript hies are not allowed. I use a PDF workflow, 
though the Companion itself uses DVI and PostScript, 
because many of the hgures in the book needed fine- 
tuning and this is more easily done in PostScript than 
in PDF. It is important to note that the Adobe Acrobat 
program is not suitable for use as a PDF previewer in 
the edit-compile-preview cycle as it does not refresh 
the view when a PDF hie is updated on disk. Various 
open-source alternatives are available that do not have 
this limitation. 

2 Preparing a Bibliography 

A potentially time-consuming and error-prone part of 
writing a paper is preparing the bibliography, which 
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@book{knut86 , 

author = "Donald E. Knuth", 

title = "The {\TeX\ book}", 

publisher = "Addi son-Wesl ey" , 

address = "Reading, MA, USA", 

year = 1986, 

pages = "ix+483", 

isbn = "0-201-13448-9"} 

Figure 2 An example of a BibTjN bibliography entry. 

contains the bibliographic details of the articles that 
are cited. In a IMpX document the bibliography entries 
are cited with a command of the form \ci te{smi t65}, 
where smit65 is a key that uniquely specifies the 
entry in a bibliography environment that contains 
the item being cited. KTpX has a companion program 
called BibTjX that extracts bibliography entries from 
a database contained in a bi b file (an ASCII file of a 
special structure with a . bi b extension) and automati- 
cally creates the bibliography environment. Figure 2 
shows an example of a bib file entry. Using BiBTjA is 
a great time-saver and ensures accurate bibliographies, 
assuming that the bi b file is accurate and kept up to 
date. Most journal Web sites allow BibTjA entries for 
papers to be downloaded, so it is easy to build up a 
personal bi b file. There exist open-source BibTjA refer- 
ence managers (such as JabRef) that facilitate creating 
and maintaining bi b files. 

A digital object identifier (DOI) is a character string 
that uniquely identifies an electronic document. It can 
be resolved into a uniform resource locator (URL) by 
preceding it with the string http://dx.doi.org/. Nowa- 
days most papers (and many books) have DOIs and 
many older papers have been assigned DOIs. A DOI 
remains valid even if the location of the document 
changes, provided that the publisher updates the meta- 
data. It is recommended to record DOIs in BibTjA 
databases, and it is then possible with the use of a suit- 
able BibTjN style file and the BTpX hyper ref package 
to include clickable links in a paper’s bibliography (for 
example, from a paper’s title). 

3 Graphics 

Mathematics papers often contain figures that plot 
functions, depict physical setups, or graph experimen- 
tal results. These can be produced in many different 
ways. In a ETjX workflow one can generate a graphic 
outside BTpX and then include it as an external JPEG, 
PostScript, or PDF file or generate it from v\4thin MjX. 


The most popular ETgX packages for graphics are TikZ 
and PGFPlots, which are built on top of the low-level 
primitives provided by the PGF (portable graphics for- 
mat) package. Most of the figures in part I of this vol- 
ume were generated using these packages. A major 
benefit of them is that they can incorporate KTpX com- 
mands and fonts, thus providing consistency with the 
main text. These powerful packages are not easy to 
use, but one can usually find an example online that 
provides a starting point for modification. 

4 Version Control and Backups 

Every good workflow contains procedures for making 
regular backups of files and recording a history of dif- 
ferent versions of the files. Backups store one or more 
copies of the current version of key files on a separate 
disk or machine, so that a hard disk failure or the loss 
of a complete machine does not result in loss of files. 
Version control serves a different purpose, which is to 
record in a repository intermediate states of files so 
that authors can revert to an earlier version of a file or 
reinstate part of one. The use of the plural in “authors” 
refers to the fact that version control systems allow 
more than one user to contribute to a repository and 
allow any user to check out the latest versions of files. 
Although version control originated in software devel- 
opment, it is equally useful for documents. As long as 
the repository is kept on a different disk or machine, 
version control also provides a form of backup. 

Of course a simple version control system is to regu- 
larly copy a file to another directory, renaming it with a 
version number (paperl.tex, paper2.tex, . . . ). However, 
this is tedious and error prone. A proper version con- 
trol system keeps files in a database and stores only the 
lines that have changed between one version and the 
next. Popular version control systems include Git and 
Subversion (SVN). Although these are command based 
and can be difficult to learn, graphical user interfaces 
(GUIs) are available that simplify their usage. 

Microsoft Word’s “track changes” feature provides 
annotations of who made what changes to a document 
(and is a primitive and widely used form of version con- 
trol). In BTjA a similar effect can be achieved by using 
the latexdiff command-line program provided with 
some MTjA distributions, which takes as input two dif- 
ferent version of a KTjA file and produces a third MjX 
file that marks up the differences between them; see 
figure 3 for an example. 
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Part of every good workflow is Ev^y^good^vorkflow 
contains^jjrocedures for making regular backups of 
files and recording a history of different versions of 
the files. Backups provide storejme or more copies of 
the current version of key files on a separate disk or 
machine, so that a hard disk failure or loss of a com- 
plete machine does not result in serious loss of files. 
Version control serves a different purpose, which is to 
record in^j^gositoi^intermediate states of files so 

Figure 3 Example of output from 1 atexdi f f . 

5 Computational Experiments 

In papers that involve computational experiments, one 
needs to include figures or tables summarizing the 
results. It is typical that as a paper is developed the 
experiments are refined and repeated. One therefore 
needs an efficient way to regenerate tables and figures. 
Cutting and pasting the output of a program into the 
paper is not a good approach. It is much better to make 
the program output the results in a form that can be 
directly included in the paper (e.g., via an \i nput com- 
mand in BT^X). literate programming [VII. 1 1 ] tech- 
niques allow program code to he included within the 
source code for a paper, and they automate the run- 
ning of the programs and the insertion of the results 
back into the paper source code. 

6 Putting It All Together 

Here are the things I do when I start to write a paper. 
I create a directory (folder) with a name that denotes 
the project in question. In that directory I copy a file 
paper . tex from a recent paper that I have written and 
use it as a template. I delete most of the content of 
paper . tex but keep the macros and some of the basic 
structural commands. I set up a repository in my ver- 
sion control system and commit paper . tex to it. I cre- 
ate a subdirectory for the computer programs I will 
write and a subdirectory named figs into which the 
PDF figures will be placed. 

7 Presentations 

As well as writing a paper about a piece of research, one 
may want to give a presentation about it in a seminar 
or at a conference. This will normally involve preparing 
slides or a poster, although it is still sometimes possible 
to give a blackboard talk. BTjX has excellent tools for 
preparing slides and posters. 

The Beamer class is the most widely used way to pre- 
pare slides in BTgX. It can create overlays, allowing a 


Frechet Derivative 


Frechet derivative of f : C nxn -t C nxn at X e C nxn 


A linear mapping L : C nxn -a C nxn s.t. for all 

E e C nxn 

f[X + E) - f{X) - L(X, E) = o(||E||). | 

Example For f(X) = X 2 we have 

f(X + E) - f(X) = XE + EX + E 2 , 

so L(X, E) = XE + EX. 

Nick Higham Matrix Functions 17 / 28 

Figure 4 A Beamer slide. 

slide to change dynamically (perhaps as an equation 
is built up, a piece at a time). Slide color and back- 
ground, and elements such as a header (which may con- 
tain a mini-table of contents) and a footer, are all readily 
customized. Figure 4 shows an example slide. 

Various BTyX packages are available for producing 
posters, at up to AO paper size. A popular one is the 
beamerposter package built on Beamer. 

8 Collaboration 

In the early days of the Internet the most common way 
for authors to collaborate was to email documents back 
and forth. A regularly encountered problem was that 
Unix mailers would insert a greater than sign in front of 
any word “from” that appeared at the start of a line of a 
plain text message, so KTjX files would often have stray 
> characters. For many people, email still serves as a 
useful mechanism for collaborative writing, but more 
sophisticated approaches are available. A file-hosting 
service such as Dropbox enables a group of users to 
share and synchronize a folder on their disks via the 
cloud. Version control based on a shared repository 
hosted on the Internet is the most powerful approach; 
it is widely used by programmers (e.g., on sites such as 
GitHub and SourceForge) and is increasingly popular 
with authors of papers. 

9 Workflow for This Book 

I wrote my articles using Emacs and TjX Live, with all 
hies under version control with Git. 1 edited some of 
my figures in Adobe Photoshop. The production editor/ 
typesetter, Sam Clark, used WinEdt and TjX Live with a 
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PostScript-based workflow, editing PostScript figures in 
Adobe Illustrator. 

1 produced a draft index for the articles that I 
authored using KTjX indexing commands and the Make- 
Index program. A professional indexer then expanded 
the index to cover the whole book. 

The font used for this book is Lucida Bright, which 
has a full set of mathematical symbols that work well 
in TjX. It is from the same family as the Lucida Grande 
sans serif font that was used throughout the Mac OS X 
user interface up until version 10.9. 

Further Reading 

Of the many good references on KTpX I recommend Grif- 
fiths and D. J. Higham (1997) for a brief introduction 
and Kopka and Daly (2004) for a more comprehensive 
treatment. Knuth (1986) continues to be worth read- 
ing, even for those who use only ETjX. Various aspects 
of workflow are covered in Higham (1998). Version con- 
trol is best explored with the many freely available Web 
resources. 

A good place to start looking for information about 
TpX and KTjX is the Web site of the TjX Users Group, 
http://tug.org. A large collection of RTjX packages is 
available at the Comprehensive TeX Archive Network 
(CTAN), http://www.ctan.org. 

Griffiths, D. F., and D. J. Higham. 1997. Learning PTjX. 
Philadelphia, PA: SIAM. 

Higham, N. J. 1998. Handbook of Writing for the Mathemat- 
ical Sciences, 2nd edn. Philadelphia, PA: SIAM. 

Knuth, D. E. 1986. The TpXbook. Reading, MA: Addison- 
Wesley. 

Kopka, H., and P. W. Daly. 2004. Guide to ETjX, 4th edn. 
Boston, MA: Addison-Wesley. 


VIII. 5 Reproducible Research in the 
Mathematical Sciences 

David L. Donoho and Victoria Stodden 


1 Introduction 

Traditionally, mathematical research was conducted 
via mental abstraction and manual symbolic manipu- 
lation. Mathematical journals published theorems and 
completed proofs, while other sorts of evidence were 
gathered privately and remained in the shadows. For 
example, long after Riemann had passed away, his- 
torians discovered that he had developed advanced 
techniques for calculating the Riemann zeta function 


and that his formulation of the Riemann hypothesis — 
often depicted as a triumph of pure thought— was actu- 
ally based on painstaking numerical work. In fact, Rie- 
mann’s computational methods remained far ahead of 
what was available to others for decades after his death. 
This example shows that mathematical researchers 
have been “covering their (computational) tracks” for 
a long time. 

Times have been changing. On the one hand, mathe- 
matics has grown into the so-called mathematical sci- 
ences, and in this larger endeavor, proposing new com- 
putational methods has taken center stage and doc- 
umenting the behavior of proposed methods in test 
cases has become an important part of research activ- 
ity (witness current publications throughout the mathe- 
matical sciences, including statistics, optimization, and 
computer science). On the other hand, even pure math- 
ematics has been affected by the trend toward compu- 
tational evidence; Tom Hales’s brilliant article “Math- 
ematics in the age of the Turing machine” points to 
several examples of important mathematical regulari- 
ties that were discovered empirically and have driven 
much subsequent mathematical research, the Birch and 
Swinnerton-Dyer conjecture being his lead example. 
This conjecture posits deep relationships between the 
zeta function of elliptic curves and the rank of elliptic 
curves, and it was discovered by counting the number 
of rational points on individual elliptic curves in the 
early 1960s. 

We can expect that, over time, an ever-increasing 
fraction of what we know about mathematical struc- 
tures will be based on computational experiments, 
either because our work (in applied areas) is explicitly 
about the behavior of computations or because (in pure 
mathematics) the leading questions of the day concern 
empirical regularities uncovered computationally. 

Indeed, with the advent of cluster computing, cloud 
computing, graphics processing unit boards, and other 
computing innovations, it is now possible for a re- 
searcher to direct overwhelming amounts of computa- 
tional power at specific problems. With mathematical 
programming environments like Mathematica, MAT- 
LAB, and Sage, it is possible to easily prototype algo- 
rithms that can then be quickly scaled up using the 
cloud. Such direct access to computational power is an 
irresistible force. Reflect for a moment on the fact that 
the Birch and Swinnerton-Dyer conjecture was discov- 
ered using the rudimentary computational resources of 
the early 1960s. Research in the mathematical sciences 
can now be dramatically more ambitious in scale and 
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Figure 1 The number of lines of code published in ACM 
Transactions on Mathematical Software, 1960-2012, on a 
log scale. The proportion of articles that published code 
remained roughly constant at about a third, with standard 
error of about 0.12, and the journal consistently published 
around thirty-five articles each year. 

scope. This opens up very exciting possibilities for dis- 
covery and exploration, as explained in experimental 
APPLIED MATHEMATICS [VIII.6]. 

The expected scaling up of experimental and com- 
putational mathematics is, at the same time, problem- 
atic. Much of the knowledge currently being generated 
using computers is not of the same quality as tra- 
ditional mathematical knowledge. Mathematicians are 
very strict and demanding when it comes to under- 
standing the basis of a theorem, the assumptions used, 
the prior theorems on which it depends, and the chain 
of inference that establishes the theorem. As it stands, 
the way in which evidence based on computations is 
typically published leaves “a great deal to the imagi- 
nation,” and computational evidence therefore simply 
does not have the same epistemological status as a 
rigorously proved theorem. 

Algorithms are becoming ever more complicated. Fig- 
ure 1 shows the number of lines of code published in 
the journal ACM Transactions on Mathematical Soft- 
ware from 1960 to 2012. The number of lines has 
increased exponentially, from 875 in 1960 to nearly 
5 million in 2012, including libraries. The number 
of articles in the journal that contain code has been 
roughly constant; individual algorithms are requiring 
ever more code, even though modern languages are 
ever more expressive. 

Algorithms are also being combined in increas- 
ingly complicated processing pipelines. Individual algo- 
rithms of the kind that have traditionally been docu- 
mented in journal articles increasingly represent only 
a small fraction of the code making up a computational 


science project. Scaling up projects to fully exploit 
the potential of modern computing resources requires 
complex workflows to pipeline together numerous 
algorithms, with problems broken into pieces and 
farmed out to be run on numerous processors and 
the results harvested and combined in project-specific 
ways. As a result, a given computational project may 
involve much infrastructure not explicitly described 
in journal articles. In that environment, journal arti- 
cles become simply advertisements : pointers to a com- 
plex body of software development, experimental out- 
comes, and analyses, in which there is really no hope 
that “outsiders” can understand the full meaning of 
those summaries. 

The computational era seems to be thrusting the 
mathematical sciences into a situation in which math- 
ematical knowledge in the wide sense, also includ- 
ing solidly based empirical discoveries, is broader and 
more penetrating but far less transparent and far less 
“common property” than ever. Individual researchers 
report that over time they are becoming increasingly 
uncertain about what other researchers have done and 
about the strength of evidence underlying the results 
those other researchers have published. 

The phrase mathematical sciences contains a key to 
improving the situation. The traditional laboratory sci- 
ences evolved, over hundreds of years, a set of pro- 
cedures for enabling the reproducibility of findings in 
one laboratory by other laboratories. As the mathe- 
matical sciences evolve toward ever-heavier reliance on 
computation, they should likewise develop a discipline 
for documenting and sharing algorithms and empirical 
mathematical findings. Such a disciplined approach to 
scholarly communication in the mathematical sciences 
offers two advantages: it promotes scientific progress, 
and it resolves uncertainties and controversies that 
spread a “fog of uncertainty.” 

2 Reproducible Research 

We fully expect that in two decades there will be 
widely accepted standards for communication of find- 
ings in computational mathematics. Such standards are 
needed so that computational mathematics research 
can be used and believed by others. 

The raw ingredients that could enable such standards 
seem to already be in place today. Problem solving envi- 
ronments (PSEs) like MATLAB, R, IPython, Sage, and 
Mathematica, as well as open-source operating systems 
and software, now enable researchers to share their 
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code and data with others. While such sharing is not 
nearly as common as it should be, we expect that it 
soon will be. 

In a 2006 lecture, Randall J. LeVeque described well 
the moment we are living through. On the one hand, 
many computational mathematicians and computa- 
tional scientists do not work reproducibly: 

Even brilliant and well-intentioned computational sci- 
entists often do a poor job of presenting their work 
in a reproducible manner. The methods are often very 
vaguely defined, and even if they are carefully defined 
they would normally have to be implemented from 
scratch by the reader in order to test them. Most mod- 
ern algorithms are so complicated that there is little 
hope of doing this properly. 

On the other hand, LeVeque continues, the ingredients 
exist: 

The idea of “reproducible research” in scientific com- 
puting is to archive and make publicly available all of 
the codes used to create the figures or tables in a paper 
in such a way that the reader can download the codes 
and run them to reproduce the results. The program 
can then be examined to see exactly what has been 
done. The development of very high level programming 
languages has made it easier to share codes and gen- 
erate reproducible research. . . . These days many algo- 
rithms can be written in languages such as MATLAB in 
a way that is both easy for the reader to comprehend 
and also executable, with all details intact. 

While the technology needed for reproducible re- 
search exists today, mathematical scientists do not yet 
agree on exactly how to use this technology in a dis- 
ciplined way. At the time of writing, there is a great 
deal of activity to define and promote standards for 
reproducible research in computational mathematics. 

A number of publications address reproducibility 
and verification in computational mathematics; top- 
ics covered include computational scale and proof 
checking, probabilistic model checking, verification of 
numerical solutions, standard methods in uncertainty 
quantification, and reproducibility in computational 
research. This is not an exhaustive account of the liter- 
ature in these areas, of course, merely a starting point 
for further investigation. 

In this article we review some of the available tools 
that can enable reproducible research and conclude 
with a series of “best-practice” recommendations based 
on modern examples and research methods. 


3 Script Sharing Based on PSEs 

3.1 PSEs Offer Power and Simplicity 

A key precondition for reproducible computational 
research is the ability for researchers to run the 
code that generated results in some published paper 
of interest. Traditionally, this has been problematic. 
Researchers were often unprepared or unwilling to 
share code, and even if they did share it, the impact 
was minimal as the code depended on a specific com- 
putational environment (hardware, operating system, 
compiler, etc.) that others could not access. 

PSEs like R, Mathematica, and MATLAB have, over the 
last decade, dramatically simplified and uniformized 
much computational science. 

Each PSE offers a high-level language for describing 
computations, often a language that is very compat- 
ible with standard mathematical notation. PSEs also 
offer graphics capabilities that make it easy to pro- 
duce often quite sophisticated figures for inclusion in 
research papers. The researcher is gaining extreme ease 
of access to fundamental capabilities like matrix alge- 
bra, symbolic integration and optimization, and sta- 
tistical model fitting; in many cases, a whole research 
project, involving a complex series of variations on 
some basic computation, can be encoded in a few 
compact command scripts. 

The popularity of this approach to computing is 
impressive. Figure 2 shows that the PSEs with the 
most impact on research (by number of citations) are 
the commercial closed-source packages Mathematica 
and MATLAB, which revolutionized technical comput- 
ing in the 1980s and 1990s. However, these systems are 
no longer rapidly growing in impact, while the recent 
growth in popularity of R and Python is dramatic. 

3.2 PSEs Facilitate Reproducibility 

As LeVeque pointed out in the quote above, a side effect 
of the power and compactness of coding in PSEs is that 
reproducible research becomes particularly straight- 
forward, as the original researcher can supply some 
simple command scripts to interested researchers, who 
can then rerun the experiment or variations of it pri- 
vately in their own local instances of the relevant 
PSE. 

In some fields, authors of research papers are already 
heavily committed to a standard of reproducing results 
in published papers by sharing PSE scripts. In statistics, 
for example, papers often seek to introduce new tools 
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Figure 2 The total number of hits on Google Scholar for 
each of the four search terms: (a) “R software”, (b) MATLAB, 
(c) Python, and (d) Mathematica. The search was carried out 
for each year in the decade 2004-13. Note that the y-axes 
are on different scales to show the increase or decrease 
in software use over time. R and Python are open source, 
whereas MATLAB and Mathematica are not. 

that scientists can apply to their data. Many authors 
would like to increase the visibility and impact of such 
methodological papers and are persuaded that a good 
way to do this is to make it as easy as possible for users 
to try the newly proposed tools. Traditional theoretical 
statistics journal papers might be able to expect cita- 
tions in the single or low double digits; there are numer- 
ous recent examples of articles that were supplemented 
by easy access to code and that obtained hundreds 


of readers and citations. It became very standard for 
authors in statistics to offer access to code using pack- 
ages in one specific PSE, R. To build such a package, 
authors document their work in a standard KTjX for- 
mat and bundle up the R code and documentation in 
a defined package structure. They post their package 
on CRAN, the Comprehensive R Archive Network. All 
R users can access the code from within R by simple 
invocations (requi re("package_name")) that direct 
R to locate, download, and install the package from 
CRAN. This process takes only seconds. Consequently, 
all that a user needs to know today to begin applying 
a new methodology is the name of the package. CRAN 
offered 5519 packages as of May 8, 2014. A side effect 
of authors malting their methodology available in order 
to attract readers is, of course, that results in their 
original articles may become easily reproducible. 1 

3.3 Notebooks for Sharing Results 

A notebook interface to a PSE stores computer instruc- 
tions alongside accompanying narrative, which can 
include mathematical expressions, and allows the user 
to execute the code and store the output, including fig- 
ures, all in one document. Because all the steps leading 
to the results are saved in a single file, notebooks can 
be shared online, which provides a way to communicate 
reproducible computational results. 

The Jupyter Notebook (formerly known as the IPy- 
thon Notebook), provides an interface to back-end com- 
putations, for example in Python or R, that displays 
code and output, including figures, with BTjX used to 
typeset mathematical notation (see figure 3). A Jupyter 
Notebook permits the researcher to track and docu- 
ment the computational steps that generate results and 
can be shared with others online using nbvi ewer (see 
http :// nb viewer .ipython. org) . 

4 Open-Source Software: A Key Enabler 

PSEs and notebook interfaces are having a very substan- 
tial effect in promoting reproducibility, but they have 
their limits. They make many research computations 


1. In fields like statistics, code alone is not sufficient to reproduce 
published results. Computations are performed on data sets from spe- 
cific scientific projects; the data may result from experiments, surveys, 
or costly measurements. Increasingly, data repositories are being used 
by researchers to share such data across the Internet. Since 2010, arXiv 
has partnered with Data Conservancy to facilitate external hosting of 
data associated with publications uploaded to arXiv (see, for example, 
http://arxiv.org/abs/1110.3649vl, where the data hies are accessible 
from the paper’s arXiv page). Such practices are not yet widespread, 
but they are occurring with increasing frequency. 
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Figure 3 A snapshot of the interactive Jupyter Notebook. 


convenient and easy to share with others, but ambi- 
tious computations often demand more capability than 
they can offer. Historically, this would have meant that 
ambitious projects had to be idiosyncratically coded 
and difficult to export to new computing environments. 

The open-source revolution has largely changed this. 
Today, it is often possible to develop all of an ambitious 
computational project using code that is freely avail- 
able to others. Moreover, this code can be hosted on an 
open-source operating system (Linux) and run within a 
standard virtual machine that hides hardware details. 
The open-source “spirit” also makes researchers more 
open to sharing code; attribution-only open-source 
licenses may also allow them to do this while retain- 
ing some assurance that the shared code will not be 
misappropriated. 

Several broad classes of software are now being 
shared in ways that we describe in this section. These 
various classes of software are becoming, or have 
already become, part of the standard approaches to 
reproducible research. 

4.1 Fundamental Algorithms and Packages 

In table 1 we consider some of the fundamental prob- 
lems that underly modern computational mathemat- 
ics, such as FAST FOURIER TRANSFORMS [II. 10], LIN- 
EAR EQUATIONS [IV. 10], and NONLINEAR OPTIMIZATION 
[IV. 11], and we give examples of some of the many 
families of open-source codes that have become avail- 
able for enabling high-quality mathematical computa- 
tion. The table includes the packages’ inception dates, 


their current release numbers, and the total numbers of 
citations that the packages have garnered since incep- 
tion. 2 The different packages within each section of the 
table may offer very different approaches to the same 
underlying problem. As the reader can see, a stagger- 
ing amount of basic functionality is being developed 
worldwide by many teams and authors in particular 
subdomains, and it is being made available for broad 
use. The citation figures in the table testify to the sig- 
nificant impact these enablers are having on published 
research. 

4.2 Specialized Systems 

The packages tabulated in table 1 are broadly useful 
in computational mathematics; it is perhaps not sur- 
prising that developers would rise to the challenge of 
creating such broadly useful tools. We have been sur- 
prised to see the rise of systems that attack very spe- 
cific problem areas and offer extremely powerful envi- 
ronments to formulate and solve problems in those 
narrow domains. We give three examples. 

4.2.1 Hyperbolic Partial Differential Equations (PDEs) 

Clawpack is an open-source software package designed 
to compute numerical solutions to hyperbolic PDEs 
using a wave propagation approach. According to the 
system’s lead author, Randall J. LeVeque, “the devel- 
opment and use of the Clawpack software implement- 
ing [high-resolution finite-volume methods for solving 
hyperbolic PDEs] serves as a case study for a more gen- 
eral discussion of mathematical aspects of software 
development and the need for more reproducibility in 
computational research.” 

The package has been used in the creation of repro- 
ducible mathematical research. For example, the fig- 
ures for LeVeque's book Finite Volume Methods for 
Hyperbolic Problems were generated using Clawpack; 
instructions are provided for recreating those figures. 

Clawpack is now a framework that offers numerous 
extensions including PyClaw (with a Python interface to 
a number of advanced capabilities) and GeoClaw (devel- 
oped for tsunami modeling [V.19] and the modeling 
of other geophysical flows). Open-source software prac- 
tices have apparently enabled not only reproducibility 
but also code extension and expansion into new areas. 


2. The data for citation counts was collected via Google Scholar in 
August 2013. Note that widely used packages such as LAPACK, FFTW, 
ARPACK, and Suitesparse are built into other software (e.g., MATLAB), 
which do not generate citations for them directly. 
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Table 1 Software for some fundamental problems 
underlying modern computational mathematics. 


Package 

Year of 
inception 

Current 

release 

Citations 

Dense linear algebra 



LAPACK 

1992 

3.4.2 

7600 

JAMA 

1998 

1.0.3 

129 

IT++ 

2006 

4.2 

14 

Armadillo 

2010 

3.900.7 

105 

EJML 

2010 

0.23 

22 

Elemental 

2010 

0.81 

51 

Sparse-direct solvers 



SuperLU 

1997 

4.3 

317 

MUMPS 

1999 

4.10.0 

2029 

Amesos 

2004 

11.4 

104 

PaStiX 

2006 

5.2.1 

114 

Clique 

2010 

0.81 

12 

Krylov-subspace eigensolvers 



ARPACK 

1998 

3.1.3 

2624 

SLEPc 

2002 

3.4.1 

293 

Anasazi 

2004 

11.4 

2422 

PRIMME 

2006 

1.1 

61 

Fourier-like transforms 



FFTW 

1997 

3.3.3 

1478 

P3DFFT 

2007 

2.6.1 

14 

DIGPUFFT 

2011 

2.4 

17 

DistButterfly 

2013 


27 

PNFFT 

2013 


215 

Fast multipole methods 



KIFMM3d 

2003 


1780 

Puma-EM 

2007 

0.5.7 

32 

PetFMM 

2009 


29 

GemsFMM 

2010 


16 

ExaFMM 

2011 


28 

PDE frameworks 




PETSc 

1997 

3.4 

2695 

Cactus 

1998 

4.2.0 

669 

deal.II 

1999 

8.0 

576 

Clawpack 

2001 

4.6.3 

131 

Hypre 

2001 

2.9.0 

384 

libMesh 

2003 

0.9.2. 1 

260 

Trilinos 

2003 

11.4 

3483 

Feel++ 

2005 

0.93.0 

405 

Lis 

2005 

1.4.11 

29 

Finite-element analysis 



Code Aster 


11.4.03 

48 

CalculiX 

1998 

2.6 

69 

deal.II 

1999 

8.0 

576 

DUNE 

2002 

2.3 

325 

Elmer 

2005 

6.2 

97 

FEniCS Project 

2009 

1.2.0 

418 

FEBio 

2010 

1.6.0 

32 


Table 1 ( Continued .) 


Package 

Year of 
inception 

Current 

release 

Citations 

Optimization 




MINUIT /MINUIT2 

2001 

94.1 

2336 

CUTEr 

2002 

rl52 

1368 

IPOPT 

2002 

3.11.2 

1517 

CONDOR 

2005 

1.11 

1019 

OpenOpt 

2007 

0.50.0 

24 

ADMB 

2009 

11.1 

175 

Graph partitioning 




Scotch 

1992 

6.0.0 

435 

ParMeTIS 

1997 

4.0.3 

4349 

kMeTIS 

1998 

1.5.3 

3449 

Zoltan-HG 

2008 

r362 

125 

KaHIP 

2011 

0.52 

71 

Adaptive mesh refinement 



AMRClaw 

1994 

4.6.3 

4800 

PARAMESH 

1999 

4.1 

409 

SAMRAI 

1998 


185 

Carpet 

2001 

4 

579 

BoxLib 

2000 


155 

Chombo 

2000 

3.1 

198 

AMROC 

2003 

1.1 

342 

p4est 

2007 

0.3.4. 1 

227 


4.2.2 Parabolic and Elliptic PDEs: DUNE 

The Distributed and Unified Numerics Environment 
(DUNE) is an open-source modular software toolbox 
for solving PDEs using grid-based methods. It was 
developed by Mario Ohlberger and other contributors 
and supports the implementation of methods such as 
finite elements, finite volumes, finite differences, and 
discontinuous Galerkin methods. 

DUNE was envisioned to permit the integrated use 
of both legacy libraries and new ones. The software 
uses modern C++ programming techniques to enable 
very different implementations of the same concepts 
(i.e., grids, solvers, linear algebra, etc.) using a com- 
mon interface with low overhead, meaning that DUNE 
prioritizes efficiency in scientific computations and 
supports high-performance computing applications. 
DUNE has a variety of downloadable modules including 
various grid implementations, linear algebra solvers, 
quadrature formulas, shape functions, and discretiza- 
tion modules. 

DUNE is based on several main principles: the separa- 
tion of data structures and algorithms by abstract inter- 
faces, the efficient implementation of these interfaces 
using generic programming techniques, and reuse of 
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existing finite-element packages v\ith a large body of 
functionality. The finite-element codes UG, ALBERTA, 
and ALUGrid have been adapted to the DUNE frame- 
work, showing the value of open-source development 
not only for reproducibility but for acceleration of 
discovery through code reuse. 3 

4.2.3 Computer-Aided Theorem Proving 

computer-aided theorem proving [VII. 3] has made 
extremely impressive strides in the last decade. This 
progress ultimately rests on the underlying computa- 
tional tools that are openly available and that a whole 
community of researchers is contributing to and using. 
Indeed, one can only have justified belief in a compu- 
tationally enabled proof with transparent access to the 
underlying technology and broad discussion. 

There are, broadly speaking, two approaches to 
computer-aided theorem-proving tools in experimen- 
tal mathematics. The first type encompasses machine- 
human collaborative proof assistants and interactive 
theorem-proving systems to verify mathematics and 
computation, while the second type includes automatic 
proof checking, which occurs when the machine verifies 
previously completed human proofs or conjectures. 

Interactive theorem-proving systems include coq, 
Mizar, HOL4, HOL Light, Isabelle, LEGO, ACL2, Veritas, 
NuPRL, and PVS. Such systems have been used to verify 
the four-color theorem and to reprove important clas- 
sical mathematical results. Thomas Hales’s Flyspeck 
project is currently producing a formal proof of the 
Kepler conjecture, using HOL Light and Isabelle. The 
software produces machine-readable code that can be 
reused and repurposed into other proof efforts. Exam- 
ples of open-source software for automatic theorem 
proving include E and Prover9/Mace 4. 

5 Scientific Workflows 

Highly ambitious computations today often go beyond 
single algorithms to combine different pieces of soft- 
ware in complex pipelines. Moreover, modern research 
often considers a whole pipeline as a single object 
of study and makes experiments varying the pipeline 
itself. Experiments involving many moving parts that 
must be combined to produce a complete result are 
often called workflows. 

Kepler is an open-source project structured around 
scientific workflows : “an executable representation of 


3. See also FEniCS (http://fenlcsproject.org) for another example of 
an open-source finite-element package. 



Figure 4 An example of the Kepler interface, showing a 
workflow solving the classic Lotka-Volterra predator-prey 
dynamics model. 

the steps required to generate results,” or the capture 
of experimental details that permit others to repro- 
duce computational findings. Kepler provides a graphi- 
cal interface that allows users to create and share these 
workflows. An example of a Kepler workflow is given 
in figure 4, solving a model of two coupled differen- 
tial equations and plotting the output. Kepler main- 
tains a component repository where workflows can 
be uploaded, downloaded, searched, and shared with 
the community or designated users, and it contains a 
searchable library with more than 350 processing com- 
ponents. Kepler operates on data stored in a variety of 
formats, locally and over the Internet, and can merge 
software from different sources such as R scripts and 
compiled C code by linking in their inputs and outputs 
to perform the desired overall task. 

6 Disse m i n ation Platforms 

Dissemination platforms are Web sites that serve spe- 
cialized content to interested visitors. They offer an 
interesting method for facilitating reproducibility; we 
describe here the Image Processing OnLine (IPOL) 
project and ResearchCompendia.org. 

IPOL is an open-source journal infrastructure de- 
veloped in Python that publishes relevant image-pro- 
cessing and image-analysis algorithms. The journal 
peer reviews article contributions, including code, and 
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Overview 

Given input image v with uniformly-sampled pixels 
v mA , the goal of interpolation is to find a function u(x,y) 
satisfying 

v m , n — u(m,n) for all m,n £ Z 

such that u approximates the underlying function from 
which v was sampled. Another way to interpret this is 
v was created by subsampling, and interpolation 
attempts to invert this process. 

We discuss linear methods for interpolation, including 
nearest neighbor, bilinear, bicubic, splines, and sine 
interpolation. We focus on separable interpolation, so 
most of what is said applies to one-dimensional 
interpolation as well as /'/-dimensional separable 
interpolation. 
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Figure 5 An example IPOL publication. The three panels from left to right include the 
manuscript, the cloud-executable demo, and the archive of all previous executions. 


publishes accepted papers in a standardized format 
that includes 

• a manuscript containing a detailed description of 
the algorithm, its bibliography, and documented 
examples; 

• a downloadable software implementation of the 
algorithm; 

• an online demo, where the algorithm can be tested 
on data sets, for example images, uploaded by the 
users; and 

• an archive containing a history of the online exper- 
iments. 

Figure 5 displays these components for a sample IPOL 
publication. 

ResearchCompendia, which one of the authors is 
developing, is an open-source platform designed to link 
the published article with the code and data that gen- 
erated the results. The idea is based on the notion of 
a “research compendium”: a bundle including the arti- 
cle and the code and data needed to recreate the find- 
ings. For a published paper, a Web page is created that 
links to the article and provides access to code and data 
as well as metadata, descriptions, and documentation, 
and code and data citation suggestions. Figure 6 shows 
an example compendium page. 

ResearchCompendia assigns a Digital Object Iden- 
tifier (DOI) to all citable objects (code, data, com- 
pendium page) in such a way as to enable bidirectional 
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Figure 6 An example compendium page on Research- 
Compendia.org. The page links to a published article and 
provides access to the code and data that generated the 
published results. 

linking between related digital scholarly objects, such 
as the publication and the data and code that gen- 
erated its results (see www.stm-assoc.org/2012_06_ 
14_STM_DataCite_Joint_Statement.pdf). DOIs are well- 
established and widely used unique persistent identi- 
fiers for digital scholarly objects. There are other PSE- 
independent methods of sharing such as via GitHub 
(which can now assign DOIs to code: https://guides.git 
hub.com/activities/citable-code) and via supplemen- 
tary materials on journal Web sites. A DOI is affixed to a 
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certain version of software or data that generates a cer- 
tain set of results. For this reason, among others, ver- 
sion control [VIII.4 §4] for scientific codes and data 
is important for reproducibility. 4 

7 Best Practices for Reproducible 
Computational Mathematics 

Best practices for communicating computational math- 
ematics have not yet become standardized. The work- 
shop “Reproducibility in Computational and Experi- 
mental Mathematics”, held at the Institute for Com- 
putational and Experimental Research in Mathematics 
(ICERM) at Brown University in 201 2, recommended the 
following for every paper in computational mathemat- 
ics. 

• A precise statement of assertions made in the 
paper. 

• A statement of the computational approach and 
why it constitutes a rigorous test of the hypothe- 
sized assertions. 

• Complete statements of, or references to, every 
algorithm employed. 

• Salient details of auxiliary software (both research 
and commercial software) used in the computation. 

• Salient details of the test environment, including 
hardware, system software, and the number of 
processors utilized. 

• Salient details of data-reduction and statistical- 
analysis methods. 

• Discussion of the adequacy of parameters such as 
precision level and grid resolution. 

• A full statement (or at least a valid summary) of 
experimental results. 

• Verification and validation tests performed by the 
author(s). 

• Availability of computer code, input data, and out- 
put data, with some reasonable level of documen- 
tation. 

• Curation. Where are code and data available? With 
what expected persistence and longevity? Is there 
a site for future updates, e.g., a version control 
repository of the code base? 

• Instructions for repeating computational experi- 
ments described in the paper. 


4. Other reasons include good coding practices enabling reuse, 
assigning explicit credit for bug fixing and code extensions or applica- 
tions, efficiency in code organization and development, and the ability 
to join collaborative coding communities such as GitHub. 


• Terms of use and licensing. Ideally code and data 
“default to open,” i.e., a permissive reuse license, if 
nothing opposes it. 

• Avenues of exploration examined throughout de- 
velopment, including information about negative 
findings. 

• Proper citation of all code and data used, including 
that generated by the authors. 

These guidelines can, and should, be adapted to dif- 
ferent research contexts, but the goal is to provide 
readers with the information (such as metadata includ- 
ing parameter settings and workflow documentation), 
data, and code they require to independently verify 
computational findings. 

8 The Outlook 

The recommendations of the ICERM workshop listed in 
the previous section are the least we would hope for 
today. They commendably propose that authors give 
enough information for readers to understand at some 
high level what was done. 

They do not actually require sharing of all code and 
data in a form that allows precise reexecution and 
reproduction of results, and as such, the recommenda- 
tions are very far from where we hope to be in twenty 
years. 

One can envision a day when every published re- 
search document will be truly reproducible in a deep 
sense, where others can repeat published computations 
utterly mechanically. The reader of such a reproducible 
research article would be able to deeply study any spe- 
cific figure, for example, viewing the source code and 
data that underlie a figure, recreating the original figure 
from scratch, examining input parameters that define 
this particular figure, and even changing their settings 
in order to study the effect on the resulting figure. 

Reproducibility at this ambitious level would enable 
more than just individual understanding; it would 
enable metaresearch. Consider the “dream applica- 
tions” mentioned in Gavish and Donoho (2012), where 
robots automatically crawl through, reproduce, and 
vary research results. Reproducible work can be auto- 
matically extended and generalized; it can be opti- 
mized, differentiated, extrapolated, and interpolated. 
A reproducible data analysis can be statistically boot- 
strapped to automatically place confidence statements 
on the whole analysis. 

Coming back down to earth, what is likely to happen 
in the near future? We confidently predict increasing 
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computational transparency and increasing computa- 
tional reproducibility in coming years. We imagine that 
PSEs wall continue to be very popular and that authors 
will increasingly share their scripts and data, if only 
to attract readership. Specialized platforms like Claw- 
pack and DUNE will come to be seen as standard plat- 
forms for whole research communities, who wall nat- 
urally then be able to reproduce work ha those areas. 
We expect that as the use of cloud computing grows 
and workflows become more complex, researchers will 
increasingly document and share the workflows that 
produce their most ambitious results. We expect that 
code will be developed on common platforms and will 
be stored in the cloud, enabling the code to run for 
many years after publication. 

We expect that over the next two decades such prac- 
tices will become standard and wall be based on tools 
of the kind discussed in this article. The direction of 
increasing transparency and increasing sharing seem 
clear, but it is still unclear which combinations of tools 
and approaches will come to be standard. 
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VIII.6 Experimental Applied 
Mathematics 

David H. Bailey and 
Jonathan M. Borwein 

1 Introduction 

“Experimental applied mathematics” is the name given 
to the use of modern computer technology as an active 
agent of research. It is used for gaining insight and intu- 
ition, for discovering new patterns and relationships, 
for testing conjectures, and for confirming analytically 
derived results, in much the same spirit that labora- 
tory experimentation is employed in the physical sci- 
ences. It is closely related to what is known as “exper- 
imental mathematics” in pure mathematics, as has 
been described elsewhere, including in The Princeton 
Companion to Mathematics. 

In one sense, most applied mathematicians have for 
decades aggressively integrated computer technology 
into their research. What is meant here is computa- 
tionally assisted applied mathematical research that 
features one or more of the following characteristics: 

(i) computation for exploration and discovery; 

(ii) symbolic computing; 

(iii) high-precision arithmetic; 

(iv) integer relation algorithms; 

(v) graphics and visualization; 

(vi) connections with nontraditional mathematics. 

Depending on the context, the role of rigorous proof 
in experimental applied mathematics may be either 
much reduced or unchanged from that of its pure 
sister. There are many complex applied problems for 
which there is little point in proving the validity of a 
minor component rather than finding strong evidence 
for the appropriateness of the general method. 

High-Precision Arithmetic 

Most work in scientific or engineering computing relies 
on either 3 2 -bit IEEE floating-point arithmetic 
[11.13] (roughly 7-decimal-digit precision) or 64-bit IEEE 



926 


VIII. Final Perspectives 


floating-point arithmetic (roughly 16-decimal-digit pre- 
cision). But for an increasing body of applied mathe- 
matical studies, even 16-digit arithmetic is not suffi- 
cient. The most common form of high-precision arith- 
metic is “double-double” or “quad” precision, which is 
equivalent to roughly 31-digit precision. Other studies 
require hundreds or thousands of digits. 

Algorithms for performing arithmetic and evalu- 
ating common transcendental functions with high- 
precision data structures have been known for some 
time, although challenges remain. Mathematical soft- 
ware packages such as Maple and Mathematica typ- 
ically include facilities for arbitrarily high precision, 
but for some applications researchers rely on Internet- 
available software, such as the GNU multiprecision 
package. 

Integer Relation Detection 

Given a vector of real or complex numbers Xu an integer 
relation algorithm attempts to find a nontrivial set of 
integers at such that a\X\ + a2X2 + ■ ■ ■ + a n x n = 0. 
One common application of such an algorithm is to find 
new identities involving computed numeric constants. 

For example, suppose one suspects that an integral 
(or any other numerical value) X\ might be a linear sum 
of a list of terms X 2 , X3, . . . , x n . One can compute the 
integral and all the terms to high precision (typically 
several hundred digits) and then provide the vector 
(x\,X 2 , ,x n ) to an integer relation algorithm. It will 
either determine that there is an integer-linear relation 
among these values, or it will provide a lower bound 
on the Euclidean norm of any integer relation vector 
( a i ) that the input vector might satisfy. If the algorithm 
does produce a relation, then solving it for X\ produces 
an experimental identity for the original integral. The 
most commonly employed integer relation algorithm is 
the “PSLQ” algorithm of mathematician-sculptor Hela- 
man Ferguson, although the Lenstra-Lenstra-Lovasz 
algorithm can also be adapted for this purpose. 

2 Historical Examples 

The best way to clarify what is meant by experimental 
applied mathematics is to show some examples of the 
paradigm in action. 

Gravitational Boosting 

One interesting space-age example is the unexpect- 
ed discovery of gravitational boosting by Michael Min- 
ovitch at NASA’s Jet Propulsion Laboratory in 1961. 


Minovitch described how he discovered that Hohmann 
transfer ellipses were not, as was then believed, the 
minimum-energy way to reach the outer planets. In- 
stead, he discovered computationally that spacecraft 
orbits that pass close to other planets could gain a sub- 
stantial boost in speed (compensated by an extremely 
small change in the orbital velocity of the planet) 
on their way to a distant location via a “slingshot 
effect.” Until this demonstration, “most planetary mis- 
sion designers considered the gravity field of a target 
planet to be somewhat of a nuisance, to be cancelled 
out, usually by onboard Rocket thrust.” 

Without such a boost from Jupiter, Saturn, and 
Uranus, the Voyager mission would have taken more 
than 30 years to reach Neptune; instead, Voyager 
reached Neptune in only 10 years. Indeed, without grav- 
itational boosting we would still be waiting! We would 
have to wait much longer still for Voyager to leave the 
solar system, as it now appears to be doing. 

Fractals and Chaos 

One prime example of twentieth-century applied exper- 
imental mathematics is the development of fractal 
theory, as exemplified by the works of Benoit Mandel- 
brot. Mandelbrot studied numerous examples of fractal 
sets, many of them with direct connections to nature. 
Applications include analyses of the shapes of coast- 
lines, mountains, biological structures, blood vessels, 
galaxies, even music, art, and the stock market. For 
example, Mandelbrot found that the coast of Australia, 
the west coast of Britain, and the land frontier of Por- 
tugal all satisfy shapes given by a fractal dimension of 
approximately 1.75. 

In the 1960s and early 1970s, applied mathemati- 
cians began to computationally explore features of 
chaotic iterations that had previously been studied 
by analytic methods. May, Lorenz, Mandelbrot, Feigen- 
baum, Ruelle, York, and others led the way in utiliz- 
ing computers and graphics to explore this realm, as 
chronicled for example in Gleick’s Chaos: Making a New 
Science. 

The Uncertainty Principle 

We finish this section with a principle that, while dis- 
covered early in the twentieth century by conventional 
formal reasoning, could have been discovered much 
more easily with computational tools. 

Most readers have heard of the uncertainty prin- 
ciple [IV.23] from quantum mechanics, which is often 
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expressed as the fact that the position and momen- 
tum of a subatomic particle cannot simultaneously be 
prescribed or measured to arbitrary accuracy. Others 
may be familiar with the uncertainty principle from sig- 
nal processing theory, which is often expressed as the 
fact that a signal cannot simultaneously be both “time- 
limited” and “frequency-limited.” Remarkably, the pre- 
cise mathematical formulations of these two principles 
are identical. 

Consider a real, continuously differentiable, L 2 func- 
tion/it), which satisfies |t| 3/2+£ /(t) — 0 as |t| — oo 
for some a > 0. (This ensures convergence of the inte- 
grals below.) For convenience, we assume f(-t) = 
fit), so the Fourier transform fix) of fit) is real, 
although this is not necessary. Define 


r 00 r 00 

£(/)= f 2 it)dt, V if) = t 2 f 2 it)dt, 

J — CO J — 00 


r 00 

fix) = fit)e~ itx dt, 

J —oo 


Qif) 


Vjf) Vjf) 
Eif) £(/)'. 


( 1 ) 

Then the uncertainty principle is the assertion that 
Qif) ^ with equality if and only if fit) = ae~ {bt)2/2 
for real constants a and b. The proof of this fact is 
not terribly difficult but is hardly enlightening (see, for 
example, Borwein and Bailey 2008, pp. 183-88). 

Let us approach this problem as an experimen- 
tal mathematician might. It is natural when studying 
Fourier transforms (particularly in the context of signal 
processing) to consider the “dispersion” of a function 
and to compare this with the dispersion of its Fourier 
transform. Noting what appears to be an inverse rela- 
tionship between these two quantities, we are led to 
consider Qif) in (1). With the assistance of Maple or 
Mathematica, one can explore examples, as shown in 
table 1. Note that each of the entries in the last column 
is in the range (5, j). Can one get any lower? 

To further study this problem experimentally, note 
that the Fourier transform / of fit) can be closely 
approximated with a fast Fourier transform, after suit- 
able discretization. The integrals V and E can also be 
evaluated numerically. 

One can then adopt a search strategy to minimize 
Qif), starting, say, with a “tent function,” then per- 
turbing it up or down by some £ on a regular grid with 
spacing 5, thus creating a continuous, piecewise-linear 
function. When, for a given S, a minimizing function 
fit) has been found, reduce a and 8 and repeat. Ter- 
minate when 8 is sufficiently small: say, 10~ 6 or so. (For 
details, see Borwein and Bailey (2008).) 


Table 1 Q values for various functions. 


fit) 

Interval 

fix) 

Qif) 

1 - t sgn t 

[-1,1] 

2(1 - cosx) 

X 2 

3 

10 

1-t 2 

[-1,1] 

4(sinx - x cosx) 
x3 

5 

14 

1 

1 + t 2 

[ — oo, 00 ] 

n exp(-x sgnx) 

1 

2 

0-ltl 

[ — 00 , 00 ] 

2 

1 + X 2 

1 

2 

1 + cos t 

[— TT,rr] 

2 sin(7Tx) 
x-x3 




Figure 1 Q-minimizer and matching Gaussian. 


The resulting function fit) is shown in figure 1. 
Needless to say, its shape strongly suggests a Gauss- 
ian probability curve. Figure 1 shows both fit) and 
the function e~ (fcn2/2 , where b = 0.45446177; to the 
precision of the plot, they are identical! 

In short, it is a relatively simple matter, using twenty- 
first-century computational tools, to numerically “dis- 
cover” the uncertainty principle. Doubtless the same 
is true of many other historical principles of physics, 
chemistry, and other fields. 

3 Twenty-First-Century Studies 

It is fair to say that the computational-experimental 
approach in applied mathematics has greatly accel- 
erated in the twenty-first century. We present a few 
specific illustrative examples here. These include sev- 
eral by the present authors because we are familiar 
with them. There are doubtless many others that we 
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are not aware of that are similarly exemplary of the 
experimental paradigm. 

3.1 Chimera States in Oscillator Arrays 

One interesting example of experimental applied math- 
ematics was the 2002 discovery by Kuramoto, Bat- 
togtokh, and Sima of “chimera” states, which arise in 
arrays of identical oscillators, where individual oscilla- 
tors are correlated with oscillators some distance away 
in the array. These systems can arise in a wide range of 
physical systems, including Josephson junction arrays, 
oscillating chemical systems, epidemiological models, 
neural networks underlying snail shell patterns, and 
“ocular dominance stripes” observed in the visual cor- 
tex of cats and monkeys. In chimera states— named 
for the mythological beast that incongruously com- 
bines features of lions, goats, and serpents— the oscil- 
lator array bifurcates into two relatively stable groups: 
the first composed of coherent, phased-locked oscilla- 
tors, and the second composed of incoherent, drifting 
oscillators. 

According to Abrams and Strogatz, who subse- 
quently studied these states in detail, most arrays of 
oscillators quickly converge into one of four typical 
patterns: 

(i) synchrony, with all oscillators moving in unison; 

(ii) solitary waves in one dimension or spiral waves 
in two dimensions, with all oscillators locked in 
frequency; 

(iii) incoherence, where phases of the oscillators vary 
quasiperiodically, with no global spatial structure; 
and 

(iv) more complex patterns, such as spatiotemporal 
chaos and intermittency. 

In chimera states, however, phase locking and incoher- 
ence are simultaneously present in the same system. 

The simplest governing equation for a continuous 
one-dimensional chimera array is 

d<p 

dt l 

= w - G(x - x ') sin[<p(x, t) - <p(x ' , t) + ex] dx', 
Jo 

( 2 ) 

where <p(x, t) specifies the phase of the oscillator given 
by x e [0,1) at time t, and G(x - x') specifies 
the degree of nonlocal coupling between the oscilla- 
tors x and x'. A discrete, computable version of (2) 
can be obtained by replacing the integral with a sum 



Figure 2 Phase of oscillations for a chimera system. 
The x-axis runs from 0 to 1 with periodic boundaries. 


over a one-dimensional array (x/t, 0 ^ k < N), where 
Xk = k/N. Kuramoto and Battogtokh took G(x —x') = 
C exp ( — /< | x - x'|) for constant C and parameter k. 

Specifying k = 4, a = 1.457, array size N = 256, and 
time step size At = 0.025, and starting from 4>(x) = 
6exp[-30(x - l/2) 2 ]r(x), where r is a uniform ran- 
dom variable on [- *, |), gives rise to the phase pat- 
terns shown in figure 2. Note that the oscillators near 
x = 0 and x = 1 appear to be phase-locked, mov- 
ing in near-perfect synchrony with their neighbors, but 
those oscillators in the center drift wildly in phase, 
with respect to both their neighbors and the locked 
oscillators. 

Numerous researchers have studied this phenome- 
non since its initial numerical discovery. Abrams and 
Strogatz studied the coupling function given by G (x ) = 
(1 + Acosx)/(2tt), where 0 ^ A ^ 1, for which they 
were able to solve the system analytically, and then 
extended their methods to more general systems. They 
found that chimera systems have a characteristic life 
cycle: a uniform phase-locked state, followed by a spa- 
tially uniform drift state, then a modulated drift state, 
then the birth of a chimera state, followed by a period 
of stable chimera, then a saddle-node bifurcation, and 
finally an unstable chimera state. 

3.2 Winfree Oscillators 

One development closely related to chimera states 
is the resolution of the Quinn-Rand-Strogatz (QRS) 
constant. Quinn, Rand, and Strogatz had studied the 
Winfree model of coupled nonlinear oscillators, namely 

N 

6i = (Vi + ^ ^ -(1 + cos 0j) sindj (3) 
v J=i 

for 1 ^ i < N, where 0j(t) is the phase of oscillator i at 
time t, the parameter i< is the coupling strength, and the 
frequencies to* are drawn from a symmetric unimodal 
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density g(w). In their analyses, they were led to the 
formula 


0= £ (2yjl-s 2 (l- 2(1 -1)/(JV — 1))2 

i-1 V 

1 ) 

■Jl-sHl -2(i- 1)/(N- l)) 2 /’ 

implicitly defining a phase offset angle <fi = shr 1 s due 
to bifurcation. The authors conjectured, on the basis 
of numerical evidence, the asymptotic behavior of the 
IV-dependent solution s to be 


1 - sn ~ 


£1 + .£! + £! + 
N N 2 N 3 


where ci = 0.60544365 ... is now known as the QRS 
constant. 

In 2008, the present authors together with Richard 
Crandall computed the numerical value of this constant 
to 42 decimal digits, obtaining 


ci a 0.60544365719673274947892284244.... 


With this numerical value in hand, it was possible to 
demonstrate that ci is the unique zero of the Hurwitz 
zeta function \z) on the interval 0 ^ z ^ 2. What 
is more, they found that C 2 = -0.104685459... is given 
analytically by 


C 2 = ci - cf - 30 


£(-?, | ci) 

£(|,§ci) 


3.3 High-Precision Dynamics 

Periodic orbits form the “skeleton” of a dynamical sys- 
tem and provide much useful information, but when 
the orbits are unstable, high-precision numerical inte- 
grators are often required to obtain numerically mean- 
ingful results. 

For instance, figure 3 shows computed symmetric 
periodic orbits for the (7 + 2) -ring problem using dou- 
ble and quadruple precision. The ( n + 2) -body ring 
problem describes the motion of an infinitesimal par- 
ticle attracted by the gravitational field of n + 1 pri- 
mary bodies, n of which are in the vertices of a regular 
polygon rotating in its own plane about its center with 
constant angular velocity. Each point corresponds to 
the initial conditions of one symmetric periodic orbit, 
and the gray areas correspond to regions of forbidden 
motion (delimited by the limit curve). To avoid “false” 
initial conditions it is useful to check if the initial con- 
ditions generate a periodic orbit up to a given toler- 
ance level; but for highly unstable periodic orbits, dou- 
ble precision is not enough, resulting in gaps in the 



Coordinate x 


Figure 3 Symmetric periodic orbits (m denotes the mul- 
tiplicity of the periodic orbit) in the most chaotic zone of 
the (7 + 2) -ring problem using (a) double and (b) quadru- 
ple precision. Note the “gaps” in the double precision plot. 
(Reproduced by permission of Roberto Barrio.) 


figure that are not present in the more accurate quad 
precision run. 

Hundred-digit precision arithmetic plays a funda- 
mental role in a 2010 study of the fractal properties of 
the lorenz attractor [III.20]; see figure 4. The first 
plot in the figure shows the intersection of an arbi- 
trary trajectory on the Lorenz attractor with the section 
z = 27, in a rectangle in the xy-plane. All subsequent 
plots zoom in on a tiny region (too small to be seen by 
the unaided eye) at the center of the red rectangle of 
the preceding plot to show that what appears to be a 
line is in fact many lines. 

The Lindstedt-Poincare method for computing peri- 
odic orbits is based on Lindstedt-Poincare perturba- 
tion theory, Newton’s method for solving nonlinear sys- 
tems, and Fourier interpolation. Viswanath has used 
this in combination with high-precision libraries to 
obtain periodic orbits for the Lorenz model at the clas- 
sical Saltzman parameter values. This procedure per- 
mits one to compute, to high accuracy, highly unsta- 
ble periodic orbits more efficiently than with conven- 
tional schemes, in spite of the additional cost of high- 
precision arithmetic. For these reasons, high-precision 
arithmetic plays a fundamental role in the study of 
the fractal properties of the Lorenz attractor (see fig- 
ures 4 and 5) and in the consistent formal development 
of complex singularities of the Lorenz system using 
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infinite series. (For additional details and references, 
see Bailey et al. (2012).) 


3.4 Ising Integrals 


The previously mentioned study employed 100-digit 
arithmetic. Much higher precision has proven essen- 


tial in studies with Richard Crandall (see Borwein and 
Bailey 2008; Bailey and Borwein 2011) of the following 
integrals that arise in the Ising theory of mathematical 
physics and in quantum field theory: 

c -±r 

Cn n\ Jo 

f“ 1 

Jo (S”=i(Wj + l/Uj)) 2 dU ’ 

4 f“ 

D n = — r 

n\ Jo 

p rii<j((wi - Uj)/(ui + uj >) 2 

Jo (Ip =l (u j + llUj)V 

^ O 

CnI 

II 

£ 

■ C( n Uk ~ Uj ) 2 dT , 

where 


d 

U\ 

r|,, k 

-, dT = df 2 ■ ■ ■ dtn, Uk = f] f i- 

U n L\ 


Note that E n < D n ^ C n . 

Direct computation of these integrals from their 
defining formulas is very difficult, but for C n it can be 
shown that 

on r°o 

Cn = ^ Jo pKZ(p)dp, 

where K 0 is the modified bessel function [IV.7 §9]. 
Thousand-digit numerical values so computed were 
used with the PSLQ algorithm to deduce results such 
as C 4 = ^£(3) and furthermore to discover that 

lim C n = 0.63047350 • • ■ = 2e~ 2y , 




Figure 5 Computational relative error versus (a) CPU time 
and (b) number of iterations in a 1000-digit computation 
of the periodic orbits LR and LLRLR of the Lorenz model. 
(Reproduced by permission of Roberto Barrio.) 
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with additional higher-order terms in an asymptotic 
expansion. One intriguing experimental result (that has 
not yet been proven) is the following: 

£5 = 42 - 1984 114(5) + w ' n4 ~ 74 ?(3) 

- 1272£(3) log 2 + 40tt 2 log 2 2 - f tt 2 
+ ^ 7T 2 log 2 + 88 log 4 2 + 464 log 2 2 - 40 log 2. 

This was found by a multi-hour computation on a 
highly parallel computer system and confirmed to 250- 
digit precision. Here, LLi(z) = Xfc>i z k /k 4 is the stan- 
dard order-4 polylogarithm. 

3.5 Ramble Integrals and Short Walks 



Consider, for complex 5 , the n-dimensional ramble 
integrals (Bailey and Borwein 2012) 


Wyi (5) = \ 

n 

^ e 2 TTXfci 

J [0,1]" 

k = 1 


(4) 


which occur in the theory of uniform random walk inte- 
grals in the plane, where at each step a unit step is 
taken in a random direction, as first studied by Pear- 
son, Rayleigh, and others 100 years ago. Integrals such 
as (4) are the 5th moment of the distance to the origin 
after n steps. As is well known, various types of random 
walks arise in fields as diverse as aviation, ecology, eco- 
nomics, psychology, computer science, physics, chem- 
istry, and biology. 


Walks and Measures 

In work from 2010 by Borwein, Straub, Wan, and 
Zudilin, using a combination of analysis and high- 
precision numerical computation, results such as 

r OO 

Wn(0) = -n log(x)/o -1 (x)Ji (x) dx 
Jo 

were obtained, where J n (x) denotes the Bessel func- 
tion of the first kind and y denotes Euler’s constant. 
These results, in turn, lead to various closed forms 
and have been used to confirm, to 600-digit precision, 
the following Mahler measure conjecture adapted from 
Villegas: 

W^(0) = (^r) 5/ } 0 ™{'7 3 (e _3t )'7 3 (e- 5t ) 

+ r7 3 (e r )r7 3 (e~ 15 t )}t 3 dt, 

where the Dedekind eta-function can be computed from 

n(q) = a 1 ' 24 n (i-3 n ) 

1 

00 

_ ^1/24 ^ (_y )H^n(3n+l)/2 


There are remarkable connections between diverse 
parts of pure, applied, and computational mathemat- 
ics lying behind these results. As is often the case, 
there is a fine interplay between developing better com- 
putational tools — especially for special functions and 
polylogarithms — and discovering new structure. 


Densities of Short Walks 


One of the deepest related discoveries is the following 
closed form for the radial density of a four-step uni- 
form random walk in the plane: for 2 ^ a ^ 4 one has 
the real hypergeometric form 


, , 2 v'16-a 2 _ 

P4(«) = — 3^2 


111 

2 ’ 2 ’ 2 
5 7 
6 ’ 6 


(16 - (X 2 ) 3 
108« 4 


Remarkably, the real part of the right-hand side of 
this identity is valid everywhere on [0, 4], as plotted in 
figure 6. This was an entirely experimental discovery- 
involving at least one fortunate error— but is now fully 
proven. 


3.6 Moments of Elliptic Integrals 

The study on ramble integrals that was discussed in 
the previous subsection also led to a comprehensive 
analysis of moments of elliptic integral functions of the 
form 

f 1 x no K n x MK 'n 2 ( x)£ «3 ( x )E’ n 4 (X) dx, 

Jo 

where the elliptic functions K and E and their comple- 
mentary versions are given by 

f 1 dt 

Jo V(1 — t 2 ) ( 1 — x 2 t 2 ) ’ 

r , , f 1 v'l - x 2 1 2 

£(x) = Jo^r^ dt ’ 

K'(x) = K(sl 1 -x 2 ), 

E' (x) = £(Vl -x 2 ). 
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Computations of these integrals to 3200-digit preci- 
sion, combined with searches for relations using the 
PSLQ algorithm, yielded thousands of unexpected rela- 
tions among these integrals (see Bailey and Borwein 
2012). The scale of the computation was required due 
to the number of integrals under investigation. 

3.7 Snow Crystals 

Computational experimentation has even been use- 
ful in the study of snowflakes. In a 2007 study, 
Janko Gravner and David Griffeath used a sophisticated 
computer-based simulator to study the process of for- 
mation of these structures, known in the literature as 
snow crystals and informally as snowfakes. Their model 
simulated each of the key steps, including diffusion, 
freezing, and attachment, and thus enabled researchers 
to study dependence on melting parameters. Snow 
crystals produced by their simulator vary from simple 
stars, to six-sided crystals with plate-ends, to crystals 
with dendritic ends, and they look remarkably similar 
to natural snow crystals. Among the findings uncov- 
ered by their simulator is the fact that these crystals 
exhibit remarkable overall symmetry, even in the pro- 
cess of dynamically changing parameters. Their simula- 
tor is publicly available at http://psoup.math.wisc.edu/ 
Snowfakes.htm. 

4 Limits of Computation 

Developments such as the above have led to reexam- 
ination of the role of computation in formal math- 
ematical work. To begin with, a legitimate question 
is whether one can truly trust— in the mathematical 
sense — the result of a computation, since there are 
many possible sources of errors: unreliable numeri- 
cal algorithms; bug-ridden computer programs imple- 
menting these algorithms; system software or compiler 
errors; hardware errors, either in processing or storage; 
insufficient numerical precision; and obscure errors of 
hardware, software, or programming that surface only 
in particularly large or difficult computations. 

As a single example of the sorts of difficulties that 
can arise, the present authors found that neither Maple 
nor Mathematica was able to numerically evaluate con- 
stants of the form 

i r 2n 

/(e i0 )dd, 

2 TT Jo 

where 

f(0) = Lii (0) m LiJ 11 (0) p Lii (0 + rr) n Lij 1 ’ (0 - ttF 


(for m,n,p,q ^ 0 integers), to high precision in reason- 
able run time. In part this was because of the challenge 
of computing polylogs and polylog derivatives (with 
respect to order) for complex arguments. The version 
of Mathematica that we were using was able to numer- 
ically compute 5 Li,; (z) /3s to high precision, which is 
required here, but such evaluations were not only many 
times slower than computation of Li s (z) itself but in 
some cases did not even return a tenth of the requested 
number of digits correctly. 

For such reasons, experienced programmers of math- 
ematical or scientific computations routinely insert 
validity checks into their code. Typically, such checks 
take advantage of known high-level mathematical facts, 
such as the fact that the product of two matrices used 
in the calculation should always give the identity or 
that the results of a convolution of integer data, done 
using a fast Fourier transform, should all be very close 
to integers. 

For instance, Kanada’s 2002 computation of rr to 
1.3 trillion decimal digits involved first computing 
slightly over one trillion hexadecimal (base-16) digits. 
He found that the 20 hex digits of tt beginning at 
position 10 12 + 1 are 

B4466E8D21 5388C4E014. 

Kanada then calculated these hex digits using the 
“Bailey-Borwein-Plouffe” algorithm. The result was 

B4466E8D21 5388C4E014, 

dramatically confirming that both results are almost 
certainly correct. While one cannot rigorously assign 
a “probability” to this event, the chances that two ran- 
dom strings of 20 hex digits perfectly agree is one in 
16 20 ~ 1.20 89 x 10 24 . 

Even so, researchers are well advised to be cautious 
with experimentation. Consider 

00 00 
cos(2x) Ff cos(x/n)dx 
3 n=l 

= 0.3926990816987241548078304229 

0993786052464 5434187231595926 . . . . (5) 

At first glance, this appears to be tt / 8, but upon 
comparison with the numerical value, 

tt/8 = 0.3926990816987241548078304229 

0993786052464 6174921888227621 . . . , 

the two values disagree after the 42nd digit! Richard 
Crandall later explained this mystery via a physically 
motivated analysis of running out of fuel random walks. 



VIII.7. Teaching Applied Mathematics 


933 


He found the following very rapidly convergent series 
expansion, of which formula (5) is the first term: 


7T 

¥ 


CO CO 00 

y cos[2(2m + l)x] f~f cos(x/n) 
rTn 0 


dx. 


m=0 


Two series terms suffice for 500-digit agreement. 
As a final sobering example, consider 


sinc(n/2) sinc(n/3) ■ ■ ■ sine (n/p) dx 


sinc(x/2) sinc(x/3) ■ ■ ■ sinc(x/p) dx, 


where in each line the divisors range over all primes up 
to p. Provably, the following is true. The “sum equals 
integral” identity for cr p remains valid at least for p 
among roughly the first 10 176 primes; but it stops hold- 
ing after some larger prime, and thereafter the “sum 
less the integral” is strictly positive, but they always dif- 
fer by much less than one part in a googolplex = 10 lt,10 °. 
An even stronger estimate is possible assuming the 
generalized Riemann hypothesis. 
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VIII.7 Teaching Applied Mathematics 


How can we enthuse the next generation of students 
about applied mathematics? The four contributors to 
this group of articles, who have all thought deeply 
about this question, were asked to give their personal 
views. The resulting articles provide a variety of per- 
spectives and will be of interest to anyone who wishes 
to inspire their students to pursue the subject. 


I. David Acheson: What’s the Big Picture? 

Let A and B be two teachers of applied mathematics (at 
any level) and suppose that, generally speaking, A is a 
much better teacher than B. 

Why is A's teaching so much better? Even without any 
further information, can we at least hazard a guess? 

1 wonder, for instance, if you might be prepared to 
bet that A is more trained in “communication skills”? Or 
perhaps A knows more mathematics than B or is nearer 
to the cutting edge of research? Then again, maybe A 
just has a more lively personality? 

All these things can be advantageous, of course, but 
I would not actually bet on any of them. 

In fact, in the absence of any further information, 
there is only one thing that I would be prepared to bet 
good money on. I would be prepared to bet that A’s 
teaching is so much better— so inspirational, at best— 
mainly because A wants it to be that way, for reasons 
that we will probably never learn and that A may not 
even know. 

This is only an opinion, of course, but it comes from 
thinking back to my own inspirational teachers when 
I was young. Some were notable for their scholarship, 
some for their eccentricity, but— so far as I can see— 
they only really had one thing in common: they had a 
great story to tell, and they really wanted to tell it. 
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“Removing Some of the Rubbish...” 

It is simple common sense with applied mathematics 
teaching— and possibly v\4th mathematics teaching of 
any kind— to start with the basics and work up. In other 
words, “Don’t try to run before you can walk.” 

But I believe it is a terrible mistake not to also bear 
in mind a very different piece of advice: namely, “If 
you have no idea where you are going, do not be too 
surprised if you never get there.” 

This is, I suspect, what the author John Ward meant 
a long time ago, in his Plain and Easie Introduction to 
the Mathematicks (1729), when he wrote: 

Tis Honour enough for me to be accounted as one of 

the under Labourers in Clearing the Ground a little, and 

Removing some of the Rubbish that lies in the way to 

Knowledge. 

In any event, I believe that a major difficulty with 
mathematics, at all levels, is that people can easily get 
bogged down in things of little consequence instead of 
engaging with things that really matter. 

And what could help them, more than anything else, 
is some kind of “big picture.” 

My own big picture of mathematics starts with won- 
derful theorems, by which I mean major results, usually 
with considerable generality and often an element of 
surprise. Secondly, beautiful proofs , i.e., concise deduc- 
tive arguments, possibly containing a truly “lightbulb” 
moment when all suddenly becomes clear. And finally, 
great applications, particularly to physics, and hence to 
our understanding of how the world really works. 

I would argue, in fact, that mathematics is at its very 
best when you get all three things at once. That, in my 
view, is when you should really open the champagne. 

More controversially, perhaps, I believe that we can, 
and should, offer some such big picture to virtually any- 
one, including very young children and the rest of the 
general public. 

Nonetheless, the majority of my teaching experience 
has been with university students, and that is where 1 
would like to turn next. 

Lectures and Classes 

One way of bringing a student lecture to life is through 
a picture or video, but best of all, perhaps, is a live 
experiment, and my own subject, fluid dynamics, lends 
itself particularly well to this. 

As an example, take two glass plates and put a blob 
of dishwashing liquid on one of them. (I dye the blob 



Figure 1 Viscous lingering. 


red, with food coloring, for dramatic effect.) Now press 
down with the other plate, so that the narrowing gap 
causes the blob to spread out in a nice, symmetric 
fashion, with a more or less circular boundary. 

But if we now gradually pull the plates apart again, 
the reverse motion is hopelessly unstable; tiny ripples 
appear in the boundary, for no apparent reason, and 
grow rapidly into long viscous fingers (see figure 1). 

If performed on an overhead projector or visualizer, 
this experiment can often make an audience gasp with 
astonishment. 

But my real point here is a little more subtle. For 
why do a demonstration like this only in an advanced 
course on fluid mechanics, along with all the associated 
theory? Why not stimulate interest by first showing it 
much earlier, perhaps even in an elementary course on 
particle dynamics, as soon as the whole idea of stability 
and instability first arises? 

Another way of helping people see the “big picture” 
is through the history of the subject, provided that 
the history in question has some real scholarship and 
depth to it. 

In a first course on particle dynamics, for example, 
my experience is that students find it genuinely inter- 
esting to actually see, with their own eyes, that their 
textbook treatment of planetary motion is spectacu- 
larly different from the one in Newton’s Principia and 
that it was not until about sixty years later, in the sub- 
sequent works of Euler and others, that dynamics came 
to be done in more or less the way it is done today. 

To take a more lightweight example, my own research 
in fluid mechanics once gave a new twist to a hundred- 
year-old problem in vortex motion, first studied by 
Augustus Love in 1894. And whenever I present this 
(as a short diversion) in student lectures, I am con- 
vinced that it is enlivened by snippets from Love’s 
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original paper, to say nothing of an early photograph of 
Love and his striking Victorian moustache (which was 
apparently much admired at the time). 

But there is another, possibly more unusual, way in 
which it is possible to bring student lectures to life. 

Imagine, if you will, that you have just arrived at 
what you perceive to be the high point of the lecture, 
where the next line in the mathematical argument is 
very clever or inventive in some way. (I would even 
include here taking the curl of the momentum equation 
in fluid mechanics, to eliminate the pressure.) 

For many years now, whenever this happens I tend to 
ask the audience whether they have any idea what the 
next, clever step might be. 

Now, conventional wisdom is, I think, that if you can 
do this sort of thing at all, you can do it only with very 
small audiences. In my experience, however, even with 
audiences of 200 or more, once they realize that no one 
can possibly be expected to know the answer and that 
they are being invited— just for a moment— to more or 
less put themselves in the shoes of some genius like 
Newton or Euler, the suggestions will start coming if 
you hold your nerve. 

Like so many things of this kind, it all depends on 
just how much you want to do it. 

Books 

A well-known publisher once said: 

Everybody has a book inside them. And it should 

usually stay there. 

However true that may be, it could be argued that the 
sheer impact and reach of a sufficiently original book 
can completely dwarf what its author might ever hope 
to achieve through direct, face-to-face teaching, and I 
am a great optimist about the future of books in the 
teaching of applied mathematics. 

And while, as far as I can tell, it takes considerable 
imagination and skill to write either an outstanding 
textbook or a successful popular mathematics book, 
I have long wondered if a real breakthrough in the 
future may instead come from some thoroughly origi- 
nal approach that combines the best elements of both. 

Public Engagement 

One of the most striking developments in recent years 
has been the rapid increase in popular mathematics 
lectures for either school students or the wider public. 


I count myself fortunate to have been involved in 
several mathematical shows of this kind, mainly for 
teenagers. They are often held in mainstream city cen- 
ter theaters, with all the paraphernalia of stage lighting, 
sound technicians, etc., and the pressure to be enter- 
taining as well as informative is therefore intense. So, 
to illustrate applied mathematics, I often use the for- 
mula for the frequency of a vibrating string, thereby 
smuggling in a practical demonstration of harmonics 
(and a self-composed tune!) on my electric guitar. 

But so-called community lectures (which are usu- 
ally held in the evening) can be even more rewarding 
because the age range at them can be enormous: from 
grandparents to very young children indeed. All you 
can really assume on these occasions is that each fam- 
ily group includes at least one person who is good at 
sums. 

It was at one of these events, at a school in North 
London, that I was midway through a “proof by pizza” 
(for the sum of an infinite series) when I happened to 
notice a particular little boy, age about ten, in the audi- 
ence. A split-second after delivering the punch line of 
my proof— at the moment when a deep idea suddenly 
becomes almost obvious— I practically saw the “light- 
bulb” go on in his head, and he got so excited that he 
fell off his chair. 

And, in a sense, that fleeting moment says it all. 

For mathematics at its best, at any level, lifts the 
human spirit, by showing us that the world — whether 
the world of the mind or the actual physical world in 
which we live— is an even more weird and wonderful 
place than we thought. 

II. Peter R. Turner: Computation, 
Modeling, and Projects 

Introduction 

This article presents a personal philosophy for teach- 
ing applied, and particularly computational, mathemat- 
ics at the undergraduate level. It is largely drawn from 
my own experience over more than 40 years, mostly 
at three institutions in the United Kingdom and the 
United States. 

That experience has been enhanced by my activities 
on the Society for Industrial and Applied Mathematics 
(SIAM) Education Committee, including four years as 
vice president for education. During this time, I have 
gained awareness of broader aspects of the role of 
applied mathematics education at the undergraduate 
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level. It is important here to note that I am not a math- 
ematics education specialist but a mathematician who 
is interested in education. 

Important among these broader aspects was the 
February 2012 report from the President’s Council of 
Advisors on Science and Technology entitled Engage to 
Excel: Producing One Million Additional College Gradu- 
ates with Degrees in Science, Technology, Engineering, 
and Mathematics, which emphasized the role of a good 
and relevant applied mathematics education within the 
framework of STEM (science, technology, engineering, 
and mathematics) education. The national emphasis in 
the United States on STEM has been a hallmark of recent 
education policy development. 

One of the key points raised was the “math gap.” 
This is a term used to highlight the difficulty in transi- 
tion from high school to undergraduate study in STEM 
disciplines — a problem that is exacerbated by what the 
colleges perceive as a lack of mathematical preparation 
in high school. There is a gap between colleges' expec- 
tations and reality. The fundamental thesis is that this 
can be addressed through a stronger founding in math- 
ematical modeling of real-world situations, and the 
solution, analysis, and validation of these models using 
computational and theoretical applied mathematics. 

The issue of college mathematics, or broader STEM, 
readiness has been an area of interest for many 
people — some at a local and highly detailed level and 
others at a broader big-picture level, such as the studies 
carried out by the Mathematical Association of Amer- 
ica. The rest of this article addresses a few ideas about 
how applied and computational mathematics might 
improve the situation. My basic thesis is that the use of 
projects that in turn require some modeling and (com- 
putational) problem solving enhances almost all (not 
just applied) mathematics teaching in all mathemat- 
ics classes. For example, calculus can be taught with 
applied projects replacing endless drills once basic 
skills are acquired. 

Use of Projects in Numerical Methods Classes 

Well over a decade ago, the basic structure of my un- 
dergraduate numerical methods/scientific computing 
courses changed to being entirely based on projects. 
The topical syllabus remained essentially unaltered, 
covering the fundamentals of nonlinear equations, 
linear systems, polynomial and spline interpolation, 
quadrature and numerical solution of ordinary dif- 
ferential equations, with each major theme being 
approached through an extended project. 


The scheme was modified to incorporate more home- 
work assignments to avoid the issue of procrastina- 
tion. The homework assigmnents included some of 
the theoretical background and some of the prelimi- 
nary steps in addressing the projects. The motivation 
for the changes was an (oversimplistic but illustrative) 
model that can explain why good students often found 
their first computational course difficult. To a faculty 
member, the class had the beauty of bringing together 
much of the students’ prior experience in calculus, lin- 
ear algebra, differential equations, and perhaps model- 
ing, too— together with drawing on their programming 
skills, or even learning some algorithmic programming 
for the first time. 

This same set of properties was the primary source 
of difficulty for the students. Suddenly, and for the 
first time, students were required to synthesize meth- 
ods and solutions from multiple courses. Furthermore, 
their mathematical and programming experience had 
typically been totally disjoint up to that point. Thus we 
had an audience who may have been “good B students” 
in both their mathematical and programming ability, 
but simplistically multiplying these independent 0.85 
probabilities (the middle of the common B grade range), 
we had a success rate of only 61%. In other words, these 
good students were struggling to get a D in numerical 
methods. 

More importantly, the effect on students’ attitudes to 
the course was impacted by a failure to see the wood/ 
forest for the trees. The perception was of a required 
course they had to endure rather than an exciting cul- 
mination of all that had gone before. The use of projects 
was broadly successful in countering this. 

The particular choice of projects is one that is impor- 
tant to the success of the course but also one that can 
be tailored to the individual instructor and audience. 
The initial set of topics I chose (and which have been 
modified successively over the years, both by me and 
by others who have taken over the immediate teaching 
responsibility) is described briefly here and illustrates 
the linkage to the main syllabus topics. 

The Length of a Telephone Cable 

A cable above even ground, and with physical param- 
eters in the model simplified to a specified sag, intro- 
duces iterative solution of a single nonlinear equation. 
The full project referred to a multi-loop cable above 
undulating ground with a profile determined by geo- 
graphic data, connected with a simple cubic spline. The 
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problem for each loop is then a nonlinear system of two 
equations, and the solutions over different pieces have 
to be matched to ensure continuity of the cable. 

Rats in a Maze 

Based on simple psychology experiments on rats’ learn- 
ing abilities, this was an open-ended project that intro- 
duced iterative solution of linear systems. Even a sim- 
ple rectangular, say 6x5, maze results in a 30 x 30 lin- 
ear system. This is an eye-opener for many students, 
who rarely see systems much larger than 3 x 3 in intro- 
ductory linear algebra. The basic problem is to com- 
pute the probability of a rat successfully finding food 
at some set of exits of the maze from an arbitrary start- 
ing point. These results provide the baseline against 
which to measure the rats’ success at learning the maze 
by comparing actual performance with the simulated 
random decisions. 

Students are then required to modify the maze 
in ways of their choosing: adding diagonal passages, 
removing certain links, adding a second level, and mod- 
ifying the decision model from purely random to some 
bias (perhaps to go straight ahead) are all variations 
that students came up with and solved. One even tried 
to apply some artificial intelligence to simulate the rat 
learning. 

Reproduce a Picture 

Although the concept of splines had been mentioned in 
the telephone cable problem, this project was the real 
introduction to interpolation. The objective was simply 
to reproduce a chosen picture or line drawing using 
interpolation. Polynomial interpolation was explored 
and usually quickly discarded for all but the simplest of 
shapes. Splines and other functions were introduced. A 
more modern treatment would probably extend this to 
using subdivision surfaces. 

The Gamma Function 

When I started this project-based course, most stu- 
dents were concurrently enrolled in an applied statis- 
tics course — hence the choice of computing the gamma 
function as the quadrature-based project. One benefit 
is that this necessarily requires modification of con- 
ventional quadrature routines to handle both singu- 
larities and an infinite range of integration. The basic 
idea was simply to find appropriate bounds for the infi- 
nite tail and a region close to the singularity, and then 
to compute the major contribution from the result- 
ing bounded integral. Using the recurrence to reduce 


the need to compute for all values of the argument 
a improved computational efficiency but necessitated 
introducing careful, though fairly simple, error analysis 
to control the required accuracy. 

Human Cannonball 

A shooting problem for a projectile with nonlinear 
air resistance was the vehicle for the introduction of 
numerical solution of differential equations. The set- 
ting was finding the appropriate launch angle in order 
to hit a specified target (described as an escape window 
to escape from the course). 

The particular list of projects above is certainly 
not intended to be prescriptive. Many improvements 
have been made, while other changes have, of course, 
proved less successful! The list here is only intended 
to illustrate the feasibility of such an approach and 
advance my thesis that project-based learning can sig- 
nificantly enhance the success of introductory scientific 
computing courses. 

Modeling across the Curriculum 

The emphases on projects, computation, and model- 
ing have combined more recently in a general “model- 
ing across the curriculum” philosophy. This started to 
take shape in the work of a SIAM Education Committee 
working group, which led to a 201 1 SIAM Review paper 
on undergraduate computational science and engineer- 
ing (CSE) programs. The report emphasized that cur- 
riculum design needs to fit local conditions, but it 
also stressed student experiences such as internships. 
Other key points in the paper concerned the role that 
undergraduate CSE education plays in regard to both 
industrial expectations for graduates and feeding the 
educational pipeline. 

Undergraduate CSE programs can take many forms. 
A few are full-fledged undergraduate majors in Com- 
putational X, where X could be physics, biology, or 
finance, for example. More common is some form of 
minor that accompanies an undergraduate major in 
either (applied) mathematics or some field of science 
or engineering. The latter model seems better suited to 
ensuring some depth in a core discipline while main- 
taining the breadth that such a minor introduces to the 
program. 

Undergraduate education in applied and computa- 
tional mathematics feeds the K-12 (preschool through 
completion of high school) education system, indus- 
trial appointments, and, of course, graduate schools in 
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all areas of science and engineering as well as in applied 
mathematics itself. 

One of the main obstacles here is that teacher educa- 
tion in mathematics in the United States is often very 
light in applied content (including statistics), and good 
programs of continuing education and professional 
development for teachers are therefore a necessary pre- 
cursor to any real change. With colleagues, I have been 
involved in a very successful design-based summer 
activity for middle and high school students, includ- 
ing some professional development for their teach- 
ers. Students, and their teachers, are introduced to 
the mathematics and physics of designing a physi- 
cal roller coaster and to modeling software to simu- 
late their design. This has been successful in bringing 
appropriately adapted real-world applications and rele- 
vance to mathematics education. An increasing propor- 
tion of these students, mostly from economically dis- 
advantaged backgrounds or other under-represented 
groups, have subsequently entered STEM college pro- 
grams, demonstrating the benefit of exposure to such 
applied content. The benefit is realized not just in 
their mathematics but also in the science that accom- 
panies it. 

Final Thoughts 

The main point I am making is that students learn bet- 
ter when they perceive their studies as being relevant 
to their lives and future careers. In the case of math- 
ematics, this provides a strong motivation to increase 
the applied and computational content at all stages of 
a student’s development. 

Early emphasis on problem solving leads to more 
advanced projects and full-scale modeling experiences 
as the students’ abilities and background knowledge 
develop. This essay addresses some of those issues at 
a “big-picture” level rather than in detail because the 
details have to be right for the combination of institu- 
tional philosophy, instructor, and students where they 
are to be applied. 

Understanding applied mathematics is inherently 
difficult because of the combined demands of the 
theoretical basis, the modeling and understanding of 
the application field, and the computational abilities 
that are needed to solve the problems. Determining that 
a “solution” really addresses the original issue, and if 
necessary refining it and solving again, are important 
aspects that only add to the inherent difficulty of the 
subject. 


In my opinion, these difficulties oblige the educa- 
tional community to address them throughout the cur- 
riculum. For example, I believe that the call for more 
emphasis on modeling and applications in the K-l 2 cur- 
riculum needs to be heard and that this move should be 
implemented soon. This must continue into the under- 
graduate program to help to address the “math gap” 
identified in the President’s Council of Advisors on Sci- 
ence and Technology report. In summary, “early and 
often” is perhaps not sufficient to achieve the desired 
improvements. I advocate, instead, “early and always” 
for modeling and applications throughout the STEM 
curriculum. 
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III. Gilbert Strang: What to Teach and How? 

In expressing these brief but strongly held thoughts 
about teaching, I would like to distinguish between two 
separate questions. 

• What should we teach? 

• How should we teach it? 
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What to Teach? 

In my experience, a good decision on what to teach tells 
students that you care about them: you are thinking 
about them and their needs. This sincerity of effort— 
what you are contributing and what you want for 
them— is the most important message you could send 
to students. We all respond favorably to sincerity. 

The teacher may think that he or she has no freedom. 
In calculus that may be partly true (and partly untrue). 
In applied mathematics, the syllabus is seldom so rigid. 
Our subject is extremely large! There is no hope of pre- 
senting or comprehending all the important directions, 
so there is an opening for independent thought. And a 
teacher who thinks independently will pass on an all- 
important message to students: that they can begin to 
work by themselves and think for themselves. 

I will compare and contrast my thoughts about lin- 
ear algebra and about computational science. Linear 
algebra in 1970 was in a very abstract and unsatisfac- 
tory state as a basic undergraduate course. Its enor- 
mous importance in practice was quite unappreciated 
by many of the standard textbooks. Finite-Dimensional 
Vector Spaces by Paul Halmos, for example, was written 
carefully and concisely, but it was written for mathe- 
matics students— stronger ones or weaker ones— and 
not for the much larger number who needed to use the 
subject. 

Outside the classroom, every year brought new appli- 
cations of linear algebra. Matrices became part of the 
core language of science and engineering and eco- 
nomics. To solve differential equations or to study large 
networks, linearity is the first step. At the same time, 
a flood of data began to arrive from much better sen- 
sors, and it often came in matrix form. The challenge 
was (and still is) to interpret that data and extract what 
is useful. 

Along with these big external changes came, inev- 
itably, a new set of options for our teaching. Exam- 
ples became more important. Instead of inventing sub- 
spaces to study, the subspaces came from the matrices 
themselves: the row space and the column space of A 
and A J . The ideas of basis and dimension and orthogo- 
nality apply to those concrete subspaces. The abstrac- 
tion of a linear transformation need not come first! 

Instead of starting with the abstract case, under- 
standing emerged from the examples themselves. This 
is how most mathematicians think and learn. Why 
should the minds of our students not work in the same 
way? 


There is certainly a danger that the new approach 
could also become too rigid. There seems to be general 
acceptance of an overall syllabus by textbook authors 
and textbook choosers. Algebraic ideas like indepen- 
dence, basis, and dimension combine with algorithms 
like LU and Gram-Schmidt. Eigenvalues are introduced 
for a purpose: to decouple the variables and make the 
matrix diagonal. Computing A and e At then becomes a 
one-dimensional problem. Factorizations of A express 
the essential facts: A = LU, A = QR, A = SAS~ l , A = 
USV T . Matrices that are important in practice (reflec- 
tions, rotations, differences) provide genuine examples. 

This subject is still partly driven by what happens 
outside the classroom. Further changes in the syllabus 
will come. In all of this renaissance I want to empha- 
size that the beautiful ideas of this subject are not sup- 
pressed! The very opposite, in fact; they become better 
understood and appreciated. 

Now for the comparison and contrast with applied 
mathematics and computational science as a whole. 
This is a vast subject and our classroom time is so lim- 
ited. I do not see rapid convergence to one fully devel- 
oped core curriculum. My own course did converge over 
a twenty-five-year period to focus on key ideas, and 
those ideas went into a textbook. But more examples 
keep emerging, and new codes. Courses on applied and 
engineering mathematics are still in a (healthy) state of 
flux. 

How to Teach? 

The heading here is a question and not a statement! 
Teaching is far too difficult, and success is too uncer- 
tain and ill-conditioned, to give an algorithm for suc- 
cess. I will give suggestions below, not rules. 

A key point: the subject is to be uncovered, not 
covered. 

It is natural to prepare for a class by deciding on 
a plan. Start with a question that it is important to 
answer. What is the inverse of a matrix and which 
matrices have inverses? 

The requirements .4 1 ,4 = I and 4 4 1 = I are 
straightforward. But those are only letters! Examples 
are needed right away. Write down 


"0 

r 


i r 


ri 

1 


f a b 1 

1 

0 _ 

’ 

i i 

’ 

o 

1 

’ 



Invert the first and third of these and describe A -1 in 
words. Show that the second matrix is not invertible. 
(The best way to do this is to see a specific vector x in 
the nullspace.) The fourth case above can reduce to a 
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test for parallel rows or parallel columns. Determinants 
can be mentioned, as can nonzero pivots. Multiple ways 
of recognizing invertibility or its opposite are highly 
valuable. 

The important point is that in working with simple 
examples you are giving the students a chance. The key 
is to build their confidence as active users of mathemat- 
ics. The teacher has to be saying, in many ways but not 
in so many words, you can do it. 

In an applied mathematics course, a correspondingly 
simple question might be the following. 

What are the solutions to d 2 >"/dx 2 = A y(x)? 

Here we are looking for eigenfunctions. The matrix in 
Ay = A y is replaced by the second derivative. No 
boundary conditions have been imposed so far. 

Now add the conditions >"(0) = 0 and y( 1) = 0. 
That should leave only the eigenfunctions y = sinfcTtx 
with their eigenvalues A = -k 2 n 2 . The goal is to make 
the idea of an “eigenfunction” familiar. The answer was 
already known, and it illuminates the question. 

1 would like to emphasize the importance of the 
teacher’s voice. All of us watch for verbal signals from 
a speaker: “This is exciting.” “This is very ordinary.” 
“Pay attention to this!” Boredom or enthusiasm come 
through so clearly. We are virtually announcing low 
expectations or raised expectations. If we are not inter- 
ested ourselves, that message overrides our words. And 
fortunately, if our own curiosity about where a particu- 
lar example leads is aroused, students understand what 
applied mathematics mostly is: following an example to 
the end. 

Instead of consulting published references, I recom- 
mend a severe critique of lecture videos. You can find 
the author’s own courses on the OpenCourseWare site 
(ocw.mit.edu): 18.06 (linear algebra) and 18.085 (com- 
putational science). The version 18.06SC of the for- 
mer involves brief lectures on problem solving by six 
teaching assistants. What is it that makes each of them 
succeed or fail? 

If only we knew more about teaching, we could define 
success. 

IV. Rachel Levy: Industrial Mathematics 
Inspires Mathematical Modeling Tasks 

Motivation 

A common refrain from high school mathematics stu- 
dents goes something like this: “We have to learn this 


stuff for twelve years! Nobody ever even uses it!” As 
a mathematics professor and former middle and high 
school mathematics teacher, I am sadly unsurprised. 
Much of the mathematics written in textbooks, even 
when it is framed as practical, can strike students as 
overly academic and divorced from reality. Real prob- 
lems require skills that most students never have the 
opportunity to practice. In this article I outline a set 
of skills that we can incorporate into assignments — 
initially one at a time, and eventually all together— to 
prepare students for the types of problems they v\411 
face in the real world, especially in science, technol- 
ogy, engineering, and mathematics fields. While I have 
compiled the list below with undergraduate students 
in mind, students would ideally have opportunities 
to practice simple versions of these skills throughout 
their mathematics education. 

I used to think modeling meant story problems. And 
why not? After all, story problems come from real sit- 
uations. But they do not ring true because they are 
too neat. Traditional story problems provide the task, 
the methodology, and the exact information needed to 
complete the task. Real problems are messy. Someone 
actually cares whether we solve them or not. We do not 
usually know how to solve them a priori. To prepare 
students to solve ill-defined, messy problems, we need 
to increase the cognitive demand we place on students 
by incorporating genuine and engaging modeling prob- 
lems into our curricula. By cognitive demand I mean 
the complexity of the tasks and the degree of decision 
making required. 

What types of mathematical modeling tasks can we 
provide? Of course, as modelers we might start with a 
model of modeling itself. Modeling can be viewed as 
an iterative process in which assumptions are made 
and then questioned, models are proposed and then 
refined, and data from real situations help to test the 
validity of the model through multiple versions of the 
solution. Models generally balance simplicity with real- 
ism, just as computations must often balance efficiency 
with accuracy. Every step of the modeling process may 
not be necessary and the process may not always pro- 
ceed in the same order, but the spirit is one of iteration 
and balance. 

With this iterative process in mind, we can begin to 
design tasks that enable students to practice model- 
ing in meaningful ways. My ideas about mathematical 
modeling stem from my experience in modeling camps, 
British-style industrial mathematics study groups, and 
the Harvey Mudd College Mathematics Clinic. In these 
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intensive experiences, groups of faculty and students 
gather to solve problems posed by companies, gov- 
ernmental organizations, and sometimes individuals. 
Rather than create a hierarchical taxonomy, I will sug- 
gest elements that can be incorporated into model- 
ing tasks to increase their cognitive demand and make 
problems richer and more realistic. 

The Modeling Tasks 

The order of the tasks below roughly follows one itera- 
tion of the modeling process. However, individual ele- 
ments can be incorporated into assignments, to pro- 
vide practice with various challenges from real-world 
mathematics. 

Tackle realistic problems. Problems that can be de- 
scribed as “real,” of course, are ones that have not yet 
been solved. The task of identifying interesting prob- 
lems with an appropriate level of challenge that are still 
tractable could be an interesting assignment for stu- 
dents, or it could fall to the faculty. Many businesses 
have problems that they need to solve or processes they 
would like to better understand. 

Interact with a vested party. Ideally, the modeling 
problem will have been posed by a party who cares 
about the outcome. When students interact with the 
problem sponsors (e.g., from a company) and end users 
of a project, they can practice professional communica- 
tion and negotiation. If the students communicate with 
the sponsor on a regular basis, the solution will be more 
likely to satisfy the sponsor. The sponsor can provide 
realistic constraints— such as time, money, hardware, 
and software limitations— as well as information about 
company/sponsor culture and policy. Communication 
with the sponsor can motivate the students, since the 
problem they must solve matters to someone outside 
the classroom. 

Define the problem statement. When a problem is 
first proposed in writing, the ideas put down on paper 
may not reflect exactly what the sponsor wants. Some- 
times the sponsor will have omitted critical informa- 
tion or constraints that necessitate a specific approach. 
The suggested method of solution may not be the best 
approach for the proposed problem. After researching 
background on the problem, students can propose rea- 
sonable goals and their preferred approach. This often 
involves negotiation among the students about how to 
proceed as well as with the sponsor about the approach 
and final deliverables. 


Disregard extraneous information. Modeling proj- 
ects often begin with a literature search, in which stu- 
dents seek relevant approaches, data, or parameter val- 
ues. The variety and scope of digitally available infor- 
mation can both facilitate and overwhelm problem- 
solving efforts. Sophisticated search strategies, such 
as those taught by librarians, can help students nav- 
igate the information overload. Students need to be 
given an opportunity to decide which of the avail- 
able information about a problem should be used to 
develop their solution. Few textbook problems con- 
tain extraneous information, whereas with a real prob- 
lem any available information (data, mathematical tech- 
niques, approaches) can be brought to bear. Students 
must determine which ideas and information are most 
salient to the problem at hand. 

Cope with messy data. Textbooks (as well as many 
models) assume that data are normally distributed or 
that they follow some other distinguishable distribu- 
tion, pattern, or trend. Real data often do not fall into a 
nice category and furthermore contain outliers, faulty 
entries, and missing values that need to be identified. In 
addition, the governing principles underlying the data 
may not be known. 

Define and justify assumptions. Models simplify 
reality via assumptions. Even when students are pre- 
sented with a model, they can discern and question the 
model’s assumptions. They can also predict what might 
happen if the various assumptions are relaxed. When 
students do have the opportunity to make modeling 
assumptions, they can learn to identify ways to simplify 
problems and to balance generality with specificity. 

Choose an approach. When we present mathematical 
models in lectures, we often have a particular approach 
in mind and actively steer the discussion in that direc- 
tion. For example, when I introduce the mass-spring 
system in a differential equations class, I know which 
equation I want the students to use to model the sys- 
tem. Ideally, students will tackle modeling problems for 
which a variety of approaches could succeed. When that 
is not possible, students who debate the motivations 
for a particular approach can begin to see modeling as 
a set of choices based on underlying principles rather 
than an application of absolute and obvious laws. 

Combine mathematical, statistical, and computation- 
al skills. In academic classes, students generally learn 
to apply a specific set of skills designated by the current 
text/topic, chapter, and problem set. In many texts, 
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the title of the section will indicate which technique 
to use. Real problems leave more to the imagination. 
In addition, for most industrial problems, mathematics 
students must bring their programming skills to bear, 
including collaborative coding, version control, com- 
menting, testing, validating, and debugging. Projects 
sometimes also require proficiency in and use of a par- 
ticular piece of software, even if the students have a 
different preferred platform or solution. 

Validate and test results. Once students have run 
their models, how can they conduct a sanity check on 
their results? If they have used data to create a model or 
algorithm, did they reserve some data for testing pur- 
poses? Under what circumstances should the results be 
most accurate? What metrics best define a good solu- 
tion? Can the model run in a reasonable amount of 
time? Does the solution meet the requirements of the 
sponsor? Students must report the accuracy and relia- 
bility of their data, assumptions, and model as well as 
estimate the error introduced by the parameter values 
and data and methods they have used. 

Iterate to refine the solution. Iteration can begin at 
many stages, as rethinking becomes necessary. Prelim- 
inary results may uncover unnecessary assumptions or 
new model requirements. While iteration is always a 
potentially useful step in the process, it is less likely to 
be feasible for short-term projects. 

Draw conclusions. Once students have designed, 
refined, and run their model, they must decide what 
conclusions they can draw from the results. The con- 
clusions will of course depend on the assumptions and 
choices that were made in the previous parts of the 
modeling process, and on the results. 

Communicate results to both general and techni- 
cal audiences. Deliverables can provide opportunities 
for students to practice mathematical writing, software 
documentation, peer review, and public speaking, as 
well as visual communication of key ideas and relation- 
ships. People interested in the results of their model 
will likely have various levels of understanding of the 
problem and its solution. Students will need to be able 
to communicate effectively with managers and officers 
as well as with technical teams. Regular communication 
with a sponsor, mid-project oral updates, and periodic 
written reports can all provide opportunities to practice 
these skills before deliverables are due. 


Practical Matters 

Typical textbook story problems rarely require any of 
the skills discussed above, even though many of the 
tasks maybe required in jobs that involve mathematical 
modeling. Modeling camps, study groups, and indus- 
trial mathematics workshops provide opportunities to 
practice combinations of the skills, but not every stu- 
dent will have those opportunities or be prepared to 
fully participate. As a way to provide practice, one or 
more of the skills can be incorporated into a mathemat- 
ics course through class activities, projects, and home- 
work problems. In this way students can benefit from 
practicing each skill in isolation before attempting to 
combine them. 

A simple place to start might be to incorporate some 
extraneous information into a problem set. Another 
possibility would be to give students a raw data set and 
ask them how they would approach the information to 
draw a conclusion. Governmental organizations, such 
as the U.S. Environmental Protection Agency, make 
data and models available online. A third option could 
require students to analyze a problem and suggest a 
course of action. The Mathematical Contest in Model- 
ing provides 10 years of online example problems and 
winning papers. 

Long-term projects, such as the year-long Mathemat- 
ics Clinic at Harvey Mudd College, require students to 
practice all of the skills above. With long-term projects 
students have the opportunity to work in teams and 
practice project management. With some guidance, stu- 
dents can learn to prepare reasonable time lines, plan 
for failure and other contingencies, and assign tasks so 
that each team member contributes to the solution. 

As we design modeling activities for students, we 
should consider how much iteration is feasible. Ide- 
ally, we would discuss the rationale for iteration with 
students even when they do not have time to imple- 
ment the iteration themselves. As we teach mathemat- 
ical modeling and provide mentoring, we also need to 
decide how much scaffolding we will provide to lead 
their modeling efforts in a desirable direction. Cogni- 
tive research in mathematics education supports the 
idea that in the early stages of learning, worked exam- 
ples can promote skill acquisition, but problem solv- 
ing is superior during later stages. While we may want 
to provide students with a framework for approach- 
ing particular problems, scaffolding can unintention- 
ally reduce the cognitive demand of the task. Students 
need to struggle in order to develop strategies for real, 
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messy problems. Therefore, as students advance, we 
must gradually take away the scaffolding and make 
sure that the problems are tackled by the students 
rather than being demonstrated to them. 

As we teach mathematical modeling, the tasks out- 
lined above can provide students with opportunities to 
develop individual skills they can apply to real prob- 
lems. When they can combine the skills, work well indi- 
vidually or in teams, and produce relevant results for 
real unsolved problems, they will be ready to make 
valuable contributions to modeling problems in sci- 
ence, technology, engineering, and mathematics fields. 
We can provide these challenges through problems 
with high cognitive demand, inspired by industrial 
mathematics. 
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VIII.8 Mediated Mathematics: 

Representations of Mathematics 
in Popular Culture and 
Why These Matter 

Heather Mendick 


1 Introduction 

As geek becomes increasingly chic, media representa- 
tions of mathematics and mathematicians proliferate. 
Writing this from England, I can currently tune into 
regular television episodes of U.S. mathematics-solves- 
crime drama Numb3rs and U.S. sitcom The Big Bang 


Theory, which features three physicists and an engineer 
among its five central characters. In 2012 U.K. comedy 
channel Dave launched a new show called Dara 6 Bri- 
ain: School of Hard Sums, in which host and mathemat- 
ical physics graduate 6 Briain competes against fellow 
comedians at tasks set by professor of mathematics 
(and occasional television presenter) Marcus du Sautoy, 
with two mathematics undergraduates in the studio as 
backup problem solvers. In this article I will be explor- 
ing such popular cultural texts, analyzing the ways in 
which they portray mathematics and mathematicians. 

First, something about me. It feels strange to be writ- 
ing for the Princeton Companion to Applied Mathemat- 
ics. Although my first degree was in mathematics and 
I taught it for more than seven years, as an academic I 
identify and work as a sociologist of education. The edi- 
tors kindly agreed that I be excused from the require- 
ment to use DTjX, a program of which I was previously 
ignorant. My move from mathematics to sociology has 
been accompanied by many changes. For example, my 
hair is no longer waist length and unstyled but short 
and dyed a vivid shade of red. My epistemological view 
has also shifted. In their book The Mathematical Expe- 
rience, Philip Davis and Reuben Hersh joke that “the 
typical working mathematician is a Platonist on week- 
days and a formalist on Sundays.” Indeed, when I stud- 
ied and taught mathematics I took a Platonic position: 
I felt that I was experiencing and relating to absolute, 
objective knowledge. Now I see mathematics, like all 
knowledge, as a social construct arising from human 
action in the world rather than being external to it. 

So, following this approach, mathematics comes into 
being through the constellations of meanings that cir- 
culate about it, the stories that we (choose to) tell about 
it, and the ways in which people take up positions 
within those stories and are positioned by them. Look- 
ing at it this way, the idea that mathematical knowledge 
is absolute is a story about mathematics, albeit one that 
is very powerful and that I used to make sense of what 
I was doing for about the first thirty years of my life. 
This story undoubtedly had the social function of bind- 
ing me to an “imagined community” of mathematicians, 
who had the ability to access this unique form of know- 
ledge. This story about mathematics is thus also one 
about who I am (and who I am not); it is part of the 
way that I constructed my self (and the story that I am 
telling here, of moving away from this, is part of how I 
construct myself now). Looking back, I wonder how this 
impacted on my views of other people and other ways 
of knowing. Perhaps this story had a psychic function: 
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what role did this conception of certain knowledge, and 
the promise of control that it offers, play for me? 

Thus, my intellectual project now is to identify the 
stories we tell about mathematics — the meanings that 
we give to mathematics— and to understand their pat- 
terning and their effects. We can, of course, find these 
meanings in workplaces, school classrooms, and uni- 
versity lecture halls. But, in contemporary society, the 
media are a primary site for the circulation of mean- 
ings, so this too is a good place to look for mathemat- 
ics. My purpose in identifying media stories is to open 
up different ways of thinking about, engaging with, 
and teaching mathematics. This article is split into four 
main parts. The first two focus on two types of mathe- 
matics (everyday and esoteric) playing out in media rep- 
resentations and the identities associated with these. 
And in the next two I focus on tracking popular repre- 
sentations of mathematics via key emotions: pleasure 
and excitement, fear and boredom. I show that, while 
all are held responsible for learning mathematics, not 
all are perceived to have mathematical ability. I end by 
offering some brief reflections on using popular culture 
within university mathematics education. 

2 The Paradox of Relevance— or Do 
We All Use Mathematics Every Day? 

In a reflective mood, Abraham Lincoln, played by Daniel 
Day Lewis in Lincoln, Hollywood’s 2013 film account 
of the abolition of slavery in the United States, recalls 
reading Euclid’s Elements: 

Euclid’s first common notion is this: “things which are 
equal to the same thing are equal to each other.” That’s 
a rule of mathematical reasoning. It’s true because it 
works. Has done and always will do. In this book, Euclid 
says this is “self evident.” You see, there it is, even in 
that 2000-year-old book of mechanical law. It is a self- 
evident truth that things which are equal to the same 
thing are equal to each other. 

Here, Lincoln is presented as using Euclid to support 
the obviousness of race equality. While there is no evi- 
dence that he actually did this, what is interesting is 
that a Hollywood movie associates mathematics with 
empathy and justice. This is a common association in 
popular mathematics. My favorite example comes from 
the classic teen comedy Mean Girls from 2004. In an 
interschool competition for teams of “mathletes,” the 
film's heroine, Cady Heron (played by Lindsay Lohan), is 
in the sudden-death round of the final. Standing at the 


podium she has a revelation. Her internal monologue 
during this runs: 

Calling somebody else fat won't make you any skinnier. 

Calling someone stupid doesn't make you any smarter. 

And ruining [frenemy] Regina George's life definitely 

didn't make me any happier. All you can do in life is 

try to solve the problem in front of you. 

This insight is part of the reasoning process through 
which Cady secures the right answer to the limit prob- 
lem in front of her, thereby securing victory for her 
team. 

In these examples, mathematics is constructed as 
part of our common humanity, intrinsic to our everyday 
being and reasoning. There are many popular examples 
that do this in other related ways. Game shows (such as 
Countdown, Deal or No Deal, and Who Wants to Be a Mil- 
lionaire) present mathematics as part of “general know- 
ledge” rather than being the province of only a lucky 
(or unlucky) few. The success of puzzles like sudokus 
and kakuros, and of computer games like Tetris and 
Dr Kawashima’s Brain Training, bring mathematics into 
people’s lives, as does interacting with mathematics 
in magazine quizzes and through sporting events (try 
working out finishes in a game of darts, for example). 
Via the media, we are continually shown or told that 
mathematics is useful or even necessary for familiar 
everyday practices such as calculating profits (in chil- 
dren's fantasy Matilda ) and counting calories (in Mean 
Girls) or for unfamiliar but important ones such as 
winning wars (in World War II films Enigma and Dam- 
busters and Cold War movie A Beautiful Mind), crimi- 
nal activity (in glamorous film Ocean’s 11), and crime 
fighting (in television series Numb3rs). 

Given the ubiquity and utility of mathematics in 
popular culture, we may expect this to have resolved 
what Ole Skovsmose refers to as the paradox of rele- 
vance: that “on the one hand, mathematics has a perva- 
sive social influence and, on the other hand, students 
and children are unable to recognise this relevance.” 
However, recent research shows that this paradox per- 
sists. I think this is because of the many contradic- 
tions in how popular mathematics is woven into our 
“life-worlds.” Some of these are evident when news- 
papers reassure people that sudokus are not mathe- 
matics or when the normally erudite presenter Melvyn 
Bragg describes himself as “blinking in the face of this 
[mathematical] assault” and “completely intrigued by 
this out of space thought that you mathematicians go 
in for.” Mathematics is both everyday activity and “out 



VIII. 8. Mediated Mathematics 


945 



Figure 1 The Eppes brothers, Charlie and Don, show 
off their contrasting approaches to fighting crime. 

of space thinking”; it is both for all and the special- 
ized province of “you mathematicians.” Now I want to 
explore these contradictions by looking in detail at the 
television series Numb3rs. 

Numb3rs (2005-10) ran for six seasons (118 episodes 
in all) in the United States on the CBS network and con- 
tinues to be shown in the United Kingdom on the sta- 
tion 5USA. The narrative centers on two brothers. Older 
brother Don Eppes (Rob Morrow) is the lead Federal 
Bureau of Investigation (FBI) agent at the Fos Angeles 
violent crime squad, while his younger brother Charlie 
(David Krumholtz) works as a professor of mathemat- 
ics at the local university, the California Institute of Sci- 
ence (CalSci). Each episode shows Don and Charlie join- 
ing forces to solve a new crime. They work alongside 
Alan (their self-proclaimed “FBI dad”), Don's FBI team, 
and two of Charlie's CalSci colleagues: Amita Ramanu- 
jan, initially his doctoral student and later his colleague 
and girlfriend, and quirky physicist Larry Fleinhardt. 
Numb3rs contains perhaps the most blatant and ped- 
agogic instance of popular culture presenting mathe- 
matics as an everyday activity undertaken by all: the 
voice-over during the credits sequence for the first two 
seasons (after this, the show followed the increasing 
trend of eliminating the opening credits). The season 
one voice-over insists: 

We all use math every day: to forecast weather; to 
handle money. We also use math to analyze crime: 
reveal patterns; predict behaviors. Using numbers we 
can solve the biggest mysteries we know. 

This plays out in the show, as mathematical “genius” 
Charlie seems to spend as much time in the FBI offices 
as at his university, appearing largely unencumbered 
by the seminars and meetings that fill my working 
life as an academic. He is therefore always on hand 
to provide an application of mathematics. His uses of 


mathematics go beyond analyzing, and inevitably solv- 
ing, crime after crime. In a notable example, in season 
four, Charlie writes a self-help book on the mathemat- 
ics of friendship ( The Attraction Equation) that quickly 
becomes a best-seller. Later, in season five, he, Larry, 
and Amita apply mathematics and physics to basket- 
ball in an attempt to halt CalSci’s entrenched losing 
streak. 

Throughout the seasons we see Charlie — and to a 
lesser extent, Amita, Larry, and other characters — 
talking about mathematics. But each episode also con- 
tains some distinct “math-bits.” These focus on the use 
of analogies to convey a mathematical idea. For exam- 
ple, in an episode where a supercomputer is suspected 
of murder, a math-bit is used to explain the Turing test 
via an analogy between computers and roses. Roses 
may be real or artificial or they may be genetically mod- 
ified so as to be artificial but indistinguishable from 
real, hence passing the Turing test for roses (if one 
existed). While the rest of the series has a common real- 
istic visual style looking much like any mainstream U.S. 
crime series, these math-bits look very different. They 
feature selective use of vivid colors against a black and 
white three-dimensional grid-like background; images 
of objects and formulas move around rapidly as if 
choreographed in time, with a voice-over from Charlie 
(or occasionally, in later seasons, another character). In 
an interesting move, the transitions into and out of the 
ads, and some other scenes, use the three-dimensional 
grid-like graphic from these math-bits. This has the 
effect of encouraging us to frame the evolving events as 
manipulable mathematically, which can be understood 
as an instance of what Skovsmose calls the formatting 
power of mathematics. This names the way “that math- 
ematics produces new inventions in reality, not only in 
the sense that new insights may change interpretations, 
but also in the sense that mathematics colonises part 
of reality and reorders it.” 

Watching the math-bits, even when I follow the analo- 
gies, while they help me understand what mathemat- 
ics can do, they give me no access to the mathematics 
itself. Charlie is repeatedly referred to as a genius, and 
these math-bits seem to be intended to provide us with 
insights into mathematical minds like his rather than 
to indicate something that ah minds could do. So ren- 
dering the everyday mathematical requires both a pro- 
cess of transformation and a person to conduct that 
transformation. In this way even the idea of mathemat- 
ics as within the everyday relies on a construction of 
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mathematics as an esoteric skill carried out by an elite, 
something I now discuss. 

3 The Mathematical Mystery Tour 

In the last section I showed that a tension runs through 
popular culture and, more widely, between stories of 
mathematics as everyday and accessible to all and as 
esoteric and accessible only to an elite. By looking at 
Numb3rs I showed that the same text can tell both sto- 
ries simultaneously by suggesting that mathematics is 
useful to all but inaccessible to most. In Numb3rs, while 
Charlie and the others who do mathematics may occa- 
sionally express uncertainty and need help, the math- 
ematics itself has no similar vulnerabilities; it is hard, 
absolute, certain. This idea of mathematics has become 
common sense. It is perhaps most explicitly stated by 
G. H. Hardy in his book A Mathematician’s Apology. 

A chair or a star is not the least like what it seems 
to be; the more we think of it, the fuzzier its outlines 
become in the haze of sensation which surrounds it; 
but “2” or “317” has nothing to do with sensation, 
and its properties stand out the more closely we scru- 
tinize it... 317 is a prime, not because we think so, 
or because our minds are shaped in one way rather 
than another, but because it is so [original emphasis], 
because mathematical reality is built in that way. 

As I said above, this is not a position I share. But I am 
not concerned here with arguing against it; instead, 1 
want to look at how this story (of mathematics as abso- 
lute and objective) is told within popular culture and 
at what effects it has. In particular, I want to track its 
relationship with mathematical elitism. 

In popular culture, the absoluteness of mathemat- 
ics is often supported by associations with spirituality, 
mystery, or magic. For example, in Dan Brown's 2003 
novel The Da Vinci Code , cryptographer Sophie Neveu 
(played in the film version by Audrey Tatou) is revealed 
to be a direct descendant of Jesus and Mary Magda- 
lene, an embodiment of the “sacred feminine,” some- 
thing that must remain in balance with the masculine. 
In the Darren Aronofsky film Pi (from 1998), a Jewish 
follower of Kabbalah believes that the lost name of God 
can be found within the decimal expansion of n. While 
(as I discuss further below) in Michael Crichton’s 1991 
novel (and Crichton and Steven Spielberg’s 1993 film) 
Jurassic Park, the predictive power of the mathematics 
of chaos takes on a magical quality. 

This leads to a representation of those who can 
access the secrets of mathematics as gifted and special. 


For example, Charlie Eppes (of Numb3rs) was five years 
ahead of his age at school, entered Princeton at thir- 
teen, and got his first journal article published at the 
tender age of fourteen. Indeed, the mathematicians 
we see on big and small screens are nearly always 
“geniuses,” from the Nobel laureate John Nash (played 
by Russell Crowe) in the 2001 biopic A Beautiful Mind to 
Pi’s Max Cohen (played by Sean Gullette), who searches 
for patterns in t t. This specialness is reinforced by their 
presentation as unstable, with Nash suffering from life- 
long schizophrenia and Max pursuing his work beyond 
obsession. These mental-health problems are often 
directly linked to their mathematical “gifts.” Scenes 
showing people doing mathematics usually involve fre- 
netic scribbling on any available surface, including mir- 
rors and windows when the more usual whiteboards 
and blackboards are unavailable. For me, the most 
striking example of a direct link between madness and 
mathematics is a fantasy sequence in which Max is 
depicted applying an electric drill to his own head, 
metaphorically excising the mathematical “gift” from 
his brain before he can go on to a happier and more 
relational future, but one in which he has lost both the 
desire and the ability to calculate. For these and the 
other big- and small-screen geniuses, mathematics is 
viewed as defining their whole personality and infusing 
every aspect of their lives. In an extreme but not atyp- 
ical example, the married pedophile mathematician in 
Kate Atkinson’s 2005 novel Case Histories : 

didn't really feel the need for another person in his life, 
in fact he found the concept of “sharing” a life bizarre. 
He had mathematics, which filled up his time almost 
completely, so he wasn’t entirely sure what he wanted 
with a wife. Women seemed to him to be in posses- 
sion of all kinds of undesirable properties, chiefly mad- 
ness, but also a multiplicity of physical drawbacks— 
blood, sex, children— which were unsettling and other 
[original emphasis]. 

This also makes explicit how, despite mathematics 
being represented as absolute and therefore outside 
of society (or perhaps even because of it), the default 
mathematician is a (white, middle-class, heterosexual) 
man. 

This gendering of mathematics is apparent in the 
only scene in A Beautiful Mind where we see some 
actual mathematics: Nash’s work on game theory. Here 
is a brief description. 

The scene is a bar with upbeat music playing. The cam- 
era focuses on a tall blonde (among a group of women) 



VIII. 8. Mediated Mathematics 


947 


and then on a group of male mathematics graduate stu- 
dents staring at her. The exception is John Nash, who is 
working, surrounded by papers and books, piled hap- 
hazardly and with a pint of beer. His fellow students 
draw his attention to “the blonde.” They look at the 
group of young women, who then look back at them. 
Nash looks uncomfortable. One student, Martin, makes 
reference to Adam Smith’s theory that “in competition, 
individual ambition serves the common good.” “The 
blonde” looks at Nash. His fellow students joke about 
Nash’s lack of success with women. There is a change in 
Nash’s posture and a change from upbeat jazzy music 
to softer piano music. He smiles and says “Adam Smith 
needs revision.” He explains that if they all go for “the 
blonde” they will block each other and upset the other 
women; however, if they cooperate, and none of them 
go for “the blonde,” they can all be successful. Dur- 
ing this exposition the images become surreal and blur 
slightly, as if the characters are puppets illustrating 
Nash’s conjectures. We get an aerial view where, in a 
geometrical pattern, we see all the men going for “the 
blonde,” then going for the other women. The cam- 
era pans from a close up of Nash to his mathematical 
“visions.” This happens alongside changes in Nash’s 
tone of voice, from nervous to authoritative, and the 
loss of his bodily twitches. This sequence ends with 
Nash saying, “that’s the only way we win, that’s the 
only way we all get laid,” as the music returns to jazzy. 
Hastily, Nash gathers his papers and leaves. He pauses 
by “the blonde,” says “thank you,” and rushes out. She 
looks puzzled. 

There are close links between this and the math- 
bits in Numh3rs , as we leave the realistic space of the 
drama and enter the figurative space of the mathemat- 
ical model. Paying attention to how women are posi- 
tioned within this narrative raises questions about who 
is represented as being able to do mathematics. “The 
blonde” acts as the silent muse for the creativity of the 
great male genius. This positioning of women as hand- 
maidens to, and inspiration for, creativity, but not as 
creative agents in their own right, is common in texts; it 
speaks to the ways in which our very notions of creative 
(mathematical) thought are gendered, a theme that is 
explored in detail in the work of Valerie Walkerdine, 
There are women doing mathematics in popular 
drama but they mostly exist, like Amita in Numb3rs, 
as daughters, students, and/or love interests of more 
central/established male mathematicians. Even Amita’s 
surname, Ramanujan, suggests her heritage from a 
male mathematician. I have already briefly mentioned 
Cady Heron from Mean Girls , one of an emerging gen- 
eration of screen “smart girls.” Yet she spends most 
of the film pretending to be bad at mathematics in 


order to attract the attention of Aaron Samuels, the 
best-looking boy in her calculus class. In the mathlete- 
sudden-death scene discussed earlier, Cady has to lit- 
erally and metaphorically see past Aaron, the object of 
her desire, in order to get her question right. He is a dis- 
traction from mathematics rather than an inspiration 
and support for it; he also gets both a name and a voice, 
unlike “the blonde” in A Beautiful Mind. Numb3rs does 
briefly introduce a senior woman, Mildred Finch, but 
she operates largely as a manager rather than a mathe- 
matician and is a love interest for the more established 
male character of Alan Eppes. After nine episodes she 
disappears from the series without explanation. 

Turning to social class, the 1997 him Good Will Hunt- 
ing contains a rare example of a mathematical genius, 
Will Hunting (played by Matt Damon), from a working- 
class Irish-American background. Will, entirely self- 
taught, works as a janitor at the Massachusetts Insti- 
tute of Technology (MIT), and the him tells the conse- 
quences of his “discovery” by MIT mathematics pro- 
fessor and Fields medallist Gerald Lambeau (played by 
Stellan Skarsgard). Will is depicted differently from the 
middle-class mathematicians. As Marie-Pierre Moreau, 
Debbie Epstein, and I noted in 2009: 

There is a scene when he loses control reacting with 
severe violence when he meets a man who abused him 
as a child. The element of physical violence in the way 
Will expresses his emotions contrasts with [Numb3rs] 
Charlie’s (more middle-class) ways. This incident hap- 
pens prior to Will entering the mathematical commu- 
nity, embarking on a course of therapy, and falling in 
love with a wealthy Harvard student, thus suggesting 
that the story of Will is also one of redemption through 
incorporation of middle-class practices. 

So Good Will Hunting is very much the story of Will’s 
“middle-classification,” as, in becoming a mathemati- 
cian, he is required to embrace the values of the mid- 
dle class and to leave his (working-class) neighborhood, 
friends, and job behind. Will-— like Charlie Eppes, Max 
Cohen, and John Nash— reinforce the association of 
mathematics with whiteness. We know that this white 
male middle-class image is easily called up when young 
people are asked to imagine a mathematician. In the 
next two sections I want to continue tracking who is 
positioned as “un/able” to do mathematics in popular 
culture by turning to the dominant emotions evoked in 
relation to mathematics in that space. I have crudely 
divided them into the positive emotions of pleasure 
and excitement and the negative emotions of boredom 



948 


VIII. Final Perspectives 


and fear, although, as I hope to show, things are more 
mixed up than that taxonomy suggests. 

4 Pleasure, Excitement, and the 
“Ability” to Do Mathematics 

If you are reading this, you are probably more likely 
than most to find pleasure and excitement in math- 
ematics. However, such feelings are often laced with 
fear. In my own research into advanced mathemat- 
ics in England, I vividly remember one young woman, 
who had chosen to continue with the subject and was 
doing well at it, saying “sometimes I dread going into 
[mathematics]”; while Robert Early (1992, p. IS), in his 
research into students’ feelings about mathematical 
challenge, quotes one powerful and terrifying account: 
“I felt as though I was jumping rope on a razor blade, 
and with each jump blood trickled onto the blank 
paper below me.” In some ways, the greater the per- 
sonal investment in the subject, the greater the fear, 
for there is more to be gained, or lost, by success or 
failure. I look at how these and other fears are mobi- 
lized in popular culture in the next section. Here, I 
focus on representations of mathematical pleasure and 
excitement. 

As illustrated above, while lots of people do math- 
ematics in popular culture, there is very little actual 
mathematics. Sound and vision indicate mathematics 
as beautiful and pattern oriented but avoid details; for- 
mulas abound but without any indication of how to 
read them. Indeed, I have written elsewhere about the 
problems that people have in seeing mathematics as 
enjoyable, showing how they usually construct it as the 
opposite of music and other forms of popular enter- 
tainment. This polarization is evident in how those who 
do enjoy mathematics are positioned as being a partic- 
ular type of person. As I have shown, they are usually 
geniuses, and they are also usually nerds or geeks. 

These terms are difficult to define precisely and 
to distinguish from one another; they intersect with 
genius. Essentially, they capture a figure who is male 
but often physically weak and/or overweight, hetero- 
sexual but awkward with women, white (or occasionally 
East or South Asian), and academically intelligent but 
socially incompetent. The Urban Dictionary Web site 
offers a range of definitions, the pithiest and most 
highly rated of which is: “The people you pick on in 
high school and wind up working for as an adult.” This 
captures the tension between awe and derision within 
the cultural gaze on geeks, a tension that reaches its 


apotheosis in “geek chic.” Here are three definitions of 
“geek chic,” also taken from Urban Dictionary: 

An obvious oxymoron, “geek chic” emerges from oxy- 
gen deprived hallucination in which geeks evolve into 
actual existence as a sort of technocracy radiating the 
holy aura of cool. You will find no more crushing an 
argument against geek chic than Bill Gates, who despite 
being the richest man in the world sports an apparent 
$5 coiffure and birth control glasses. 

Clothing or accessories that are very geeky/nerdy and 
yet, at the same time, says [sic] “I’m cool because I’m 
proud of the fact that I'm a nerd, and am not afraid to 
dress the part.” 

Geek Chic within it’s self [sic] is an oxymoron and ironic 
twist of events by people who are generally not geeks, 
attractive men and women who want to act like geeks 
but don't actually have the passion for geeky hobbies 
or the intellect. Real geeks got picked on, and were 
proud to call themselves geeks because they could con- 
struct their own PCs or were very good at math or 
something, not because they spend massive abouts 
[sic] of times [sic] on Xbox play [sic] Call Of Duty or 
in [sic] some consumer electronic [sic] like an iPod or 
iPhone. Real geeks consider iPods and iPhones basic 
shit. 

In the first definition we see the intensity of the hatred 
that can be directed at geeks via a strong objection 
to the claim that geeks could ever be chic. Contrast- 
ing with this, the next two definitions are written from 
geek positions' they both locate value in geeks for 
being proud of who they are despite others teasing 
and bullying. The first and third definitions assert that 
“geek chic” is oxymoronic but for different reasons: one 
rejects geek chic, asserting geeks’ inbuilt inferiority; 
the other constructs an authentic geekness, asserting 
geeks’ superiority over those who have the desire to 
be “real geeks” but lack the passion and intelligence. 
All three definitions support the construction of math- 
ematics and those who do this as “special.” They differ 
on whether it is good to be a geek or not, but they agree 
that you either are one or you are not. There is no sense 
of engaging with and enjoying mathematics or technol- 
ogy as one might other school subjects, like history or 
Spanish, as an interested, informed amateur. 

The most successful series to capitalize on the rise 
of geek chic is the sitcom The Big Bang Theory (2007-). 
The program centers on four male geeks and their 
unlikely friendship with the blonde, stylish, sociable, 
would-be actress Penny. Although they are physicists 
and engineers, they are often shown doing mathemat- 
ics. Three are white (one of whom is Jewish) and one 
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Figure 2 Howard, Sheldon, Leonard, Penny, and Raj (left to 
right) pictured in the apartment that Sheldon and Leonard 
share. 


is Indian. Leonard, the most conventional and sexually 
successful of the four, is the most ashamed of his geek- 
iness. For example, he conceals the fact that he plays 
word games in the science-fiction language Klingon. He 
suffers from a range of health problems and nervous 
disorders including lactose intolerance, sleep apnoea, 
migraines, carsickness, nose bleeds, and asthma. He 
tilts his head when he talks and has a tendency to 
whine; he regularly applies excessive hair gel and wears 
mismatched clothing. At the other extreme is Sheldon, 
who is unashamedly geeky. He interprets everything lit- 
erally (needing to be taught to recognize irony), feels 
compelled to order his surroundings (even when this 
involves breaking into his new neighbor’s flat), and wor- 
ries that massaging his own shoulders involves exces- 
sive physical contact. In between these two on the geek- 
iness scale are Howard, who lives with his overbearing 
mother, and Raj, who cannot talk to women while sober. 
Women have a strange position in the show, as Moreau 
and I noted in Debates in Mathematics Education: 

Most of the female characters, like Penny, exist to con- 
trast with the male geeks, and implicitly emphasise 
their heterosexuality. The two female geeks are periph- 
eral characters introduced as potential love interests 
for the men. They stand out like the one female geek 
and one gay geek included in a total of 42 geeks featur- 
ing across five seasons of US Reality TV show Beauty 
and the Geek. 

While mathematics may be depicted as neutral know- 
ledge, geek stories like those of geniuses reinforce the 
idea that the “ability” to do mathematics belongs in par- 
ticular bodies. In these and other shows, popular cul- 


ture makes jokes with and about geeks but also values 
them for being intelligent and unafraid of being dif- 
ferent. In The Big Bang Theory, Penny clearly enjoys 
hanging out with the geek gang and she dates Leonard 
and has a one-night stand with Raj. Their ability to find 
fun in mathematics, science, and technology seems to 
extend into other areas. 

From Howard’s dependence on his mother to Leon- 
ard’s preponderance of childhood illnesses, there is a 
boyishness to the way geeks are depicted. All of The Big 
Bang Theory’s characters own a large number of comic, 
fantasy, and science-fiction themed toys and memo- 
rabilia and are regularly shown playing games. How- 
ever, their skills are also depicted as being immensely 
valuable, securing significant government funding; this 
is exemplified when Howard’s technical expertise wins 
him a place on a space mission despite his physi- 
cal inadequacies when compared with the other astro- 
nauts. Within popular culture, this power of mathemat- 
ical knowledge is inseparable from excitement. 

I end this section with one last example of power and 
excitement: Jurassic Park, which also contains perhaps 
the only screen example of a “cool” mathematician. In 
both the book and him versions, genetic engineering 
has been used to manufacture live dinosaurs in order 
to create a prehistoric amusement park. The narrative 
takes place on a weekend inspection of this project 
prior to opening by two paleontologists, a lawyer, and 
a mathematician, Ian Malcolm, accompanied by the 
park’s driving force, John Hammond, his two grand- 
children, and other park employees. Malcolm, played 
by Jeff Goldblum in the him, wears black throughout 
(including shades and a leather jacket), hirts outra- 
geously, and “suffers from a deplorable excess of per- 
sonality, especially for a mathematician.” He is clearly 
able to use chaos theory to predict the ultimate failure 
of Jurassic Park and its end in disaster and death. Even 
when he is told that all the dinosaurs are female, he cor- 
rectly predicts that they will sexually reproduce: “life 
will hnd a way.” This predictive power is most appar- 
ent in the book, where each section is introduced by a 
quotation from Malcolm and by a growing image of a 
fractal: 

At the earliest drawings of the fractal curve, few clues 
to the underlying mathematical structure will be seen. 

With subsequent drawings of the fractal curve, sudden 
changes may appear. 

Details emerge more clearly as the fractal curve is 
redrawn. 
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Inevitably, underlying instabilities begin to appear. 

Flaws in the system will now become severe. 

System recovery may prove impossible. 

Increasingly, the mathematics will demand the courage 
to face its implications. 

In the book, and to a lesser extent in the film, Malcolm 
articulates a critique of scientific progress. He com- 
pares the actions of Hammond at Jurassic Park to those 
of a child with his father’s gun: since he and his sci- 
entists did not work for the knowledge they are using, 
they exercise no responsibility in relation to it. He ques- 
tions the value of discovery, calling it a violent and 
penetrative act and comparing it to rape. But even as 
“progress” is radically challenged, this is undermined 
by the role that mathematics plays in understanding 
and ultimately controlling events. The unknowing audi- 
ence look on in wonder and relief at both the dinosaurs 
and at the power of mathematics and of the “mathema- 
gician” who wields it. In the final section of this article 
I consider the way in which popular culture portrays 
those of us in the audience looking on. 

5 Boredom, Fear, and the 
Responsibility to Do Mathematics 

In the “Brotherhood” episode of television drama series 
Six Feet Under, teenager Claire Fisher is shown in a 
high school algebra class, unable to explain the for- 
mula on the board. Her teacher reprimands her, “well 
maybe if you paid attention in class instead of reading.” 
Claire responds, “well maybe if you talked about some- 
thing that was actually gonna be useful to me I would,” 
and returns to reading her book. Her teacher persists: 
“Oh algebra is useless. Hmm, know a lot of physicists 
who’d beg to differ.” Claire, sullenly looking up, replies, 
“well I don't want to be a physicist.” Her teacher, arms 
unfolded, speaks lyrically: “algebra forces your mind to 
solve problems logically, it’s one of the only perfect sci- 
ences.” Claire interrupts: “Do you think the world runs 
on logic? Open your eyes.” Her teacher’s final sally is 
“ok, I’ll see you after class Miss Fisher.” At this point 
Claire stares very intensely at her teacher from whose 
head steam is beginning to rise; her head then explodes, 
and Claire laughs. 

Here we see a familiar trope in popular culture: 
teenage alienation. Mathematics stands for all that is 
most boring and pointless about education. The scene 
offers a vivid dramatization of the failure of mathemat- 
ics to resolve the paradox of relevance— the juxtaposi- 
tion of a personal feeling of the futility of mathematics 


with a generalized belief in its utility. Here and else- 
where mathematics figures as a counterpoint to imagi- 
nation and creativity. When asked to draw a teacher, the 
most common response of primary-school children is 
to draw someone smartly dressed, standing behind a 
desk and/or in front of a blackboard, and teaching for- 
mal mathematics. In another example taken from the 
“Bart the Genius” episode of The Simpsons, ten-year-old 
Bart, faced with a mathematics test, spends the whole 
of the allocated time engaged in a daydream provoked 
by the first test question: a word problem about two 
trains headed in opposite directions. Both these exam- 
ples suggest that the boredom response is not simply 
about the dullness of mathematics but also indicates 
a fear of engaging with the subject. Perhaps boredom 
partly serves as a defense against failure in mathemat- 
ics. To understand why such a defense is so impor- 
tant to so many people, we need to look at the status 
conferred upon mathematics. 

In the example from Six Feet Under, we can also see 
the obligation placed upon Claire, by her teacher and 
by society more widely, to learn mathematics. Policy 
and public discourses commonly present mathematics 
as a key to both national progress (economic and scien- 
tific/technological) and individual progress (empower- 
ment, employment, and success). For example, in 2004 
Adrian Smith was commissioned by the U.K. govern- 
ment to undertake a major enquiry into mathematics 
education after the age of fourteen. Near the start of his 
report (published under the title “Making mathematics 
count”) he states: 

It has been widely recognised that mathematics occu- 
pies a rather special position. It is a major intellec- 
tual discipline in its own right, as well as providing the 
underpinning language for the rest of science and engi- 
neering and, increasingly, for other disciplines in the 
social and medical sciences. It underpins major sec- 
tors of modern business and industry, in particular, 
financial services and ICT. It also provides the individ- 
ual citizen with empowering skills for the conduct of 
private and social life and with key skills required at 
virtually all levels of employment. 

Here we can see how individual fears associated with 
mathematics (of individual failure, social exclusion, 
being judged) are brought together with national fears 
(of national failure, economic exclusion, being uncom- 
petitive). Mathematics can bring together these fears 
because it serves as a signifier of intelligence and tech- 
nological progress. The books Do You Panic about 
Mathematics? (1981) and Overcoming Math Anxiety 
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(1978) by Laurie Buxton and Sheila Tobias, respectively, 
tracked how mathematics anxiety and panic derive 
from people's feelings of being judged via mathemat- 
ics and found wanting. Research has shown that the 
introduction of national testing into primary schools 
in England has meant that ever-younger children are 
exposed to high-stakes assessment and to grouping by 
“ability” in mathematics, thereby increasing the anxiety 
and fear attached to the subject. 

Government advertising to promote take-up of adult 
mathematics provision has played on these fears. In 
these advertising campaigns, we find people who are 
unable to do mathematics and who are urged to take up 
the opportunity to “get rid of your mathematics grem- 
lin” and “become mathematics confident.” Thus I end 
this overview of popular culture with a discussion of 
advertisements from the two most prominent recent 
U.K. adult numeracy campaigns. The first is called “Bad 
Dad” (it can be viewed online at www.youtube.com/ 
watch?v=CzQvD9oBx-0) and comes from a series of ads 
that use gremlins as a metaphor for people’s gaps in 
mathematical and English skills. It is filmed with very 
little color, perhaps suggesting the darkness of igno- 
rance, in which its central character resides, the epony- 
mous “bad dad.” It opens with the camera on him: 
white, aged 35-40, overweight, casually dressed, sit- 
ting on a drab sofa, watching television. This figure 
draws on cliches of the “feckless working class,” whose 
unhealthy bodies are imagined to be permanently stuck 
in front of the television. 

The camera turns to his daughter, about ten years 
old, with neatly combed hair, wearing a school uni- 
form, who calls from the next room: “Can you help me 
with my maths?” At this her dad’s face shows panic 
and we see his ugly grey gremlin lounging on the sofa 
with pointy ears and nose. The gremlin speaks with 
obvious disdain: ‘You. Maths. That’s a good one.” The 
camera focuses on dad’s worried face as he suggests: 
“Ask your mother.” When his daughter tells him “She’s 
out,” the gremlin taunts, “Oooooh! She’s gone out. 
What are you going to do now?” Dad, becoming more 
panicked, seems out of breath (again suggesting his 
unhealthy body), asks: “Why don’t you use your calcula- 
tor?” Daughter: “That’s cheating!” The gremlin scolds: 
“Bad dad. Very bad dad!” The daughter enters the room 
and looks at the television screen with confusion: “Dad. 

I need to do it now!” Her dad looks ashamed. 

The contrast between the hard-working, neat girl and 
her lazy, untidy dad is striking. It directly invokes fears 
of being a bad parent, who cannot support their child’s 



Figure 3 Beryl as a school girl “too scared” to put 
up her hand and ask about mathematics. 


education and fails to inspire them to “high” aspi- 
rations. The solution offered is to take advantage of 
the available adult education classes and so transform 
oneself. 

The second ad centers on a woman called Beryl, a 
name associated in England with elderly working-class 
women (this can be viewed online at www.youtube 
.com/watch?v=-xrQoU6qZLU). Although her name is 
not revealed in the ad itself, her age and social class 
are indicated visually via her hair curlers, cup of tea, 
and other signs of “ordinariness.” All the characters 
are created using hand puppetry and a small number 
of props. 

Beryl describes her difficult and troubling experiences 
with mathematics when she was at school: “Well you 
know looking back I know the exact time and place 
where I lost my confidence with mathematics. It was 
back in class 4C and I lost my ways with times tables.” 

A male teacher, carrying glasses and wearing a mor- 
tarboard and bow tie (both symbols of middle-class 
teacherly authority), appears in order to chant times 
tables: “Seven eights are fifty six, eight eights are sixty 
four, . . . Beryl goes on to explain how taking the adult 
mathematics course has helped to improve her confi- 
dence: “I was too scared to put my hand up and say 
it’s all gobbledygook to me. But I've just taken this 
free adult mathematics course at my local college. They 
start you at the point where you got lost and now my 
fear of mathematics has just gone away.” 

The contrasts between the teacher and Beryl emphasize 
her social class and gender. 

In both these ads an obligation is placed on the 
individual to repair the damage done by school or by 
their own earlier irresponsibility, otherwise they are 
failing: failing their child, failing themselves, failing 
their nation. Wider fears of mathematics are deliber- 
ately invoked to motivate action. The gender and class 
positions of the characters in these ads are almost 
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opposites of those of the geniuses and geeks we met 
earlier. In this way some people are presented as always 
and already failures in relation to mathematics, in need 
of remediation, but also as having a personal respon- 
sibility to take up the offered support. Others are pre- 
sented as self-taught, ahead of their age: our hope for 
the future. 

6 Conclusion 

In this article I have surveyed some of the many ways 
in which mathematics and people doing mathemat- 
ics are represented in contemporary popular culture. 
I have tried to draw out the contradictions, notably 
those between, on the one hand, mathematics as acces- 
sible, open, useful, and exciting and, on the other hand, 
mathematics as hard, closed, useless, and boring. In 
2004 Sarah Greenwald and Andrew Nestler offered 
some helpful examples of how to use the popular 
when teaching mathematics within universities, show- 
ing how this can provide more points of engagement 
and identification for students as they interact math- 
ematically with characters and storylines. In the inter- 
vening decade there has been an increase in the possi- 
bilities for such an approach, with the proliferation of 
mathematics within the popular, including much mate- 
rial on YouTube (from Web series “Maths Warriors,” 
through quirky educational channel “Numberphile,” to 
another channel (from missionastar) that attempts to 
teach quadratics via a Bollywood spoof). However, pop- 
ular culture can include some and exclude others. For, 
as I have shown, while society confers on all a respon- 
sibility to become mathematically literate, it suggests 
that only a special few possess mathematical “ability.” 
It overwhelmingly depicts this ability as belonging in 
white, male, middle-class, heterosexual bodies. Thus, 
its use requires care and thought. I return to Numb3rs 
briefly to suggest what such care and thought entail. 

Texas Instruments have produced materials for high 
school teachers to use to explore the mathematics in 
Numb3rs, and the Wolfram Web site has produced 
something similar for teachers of undergraduate math- 
ematics. But ethically, can we talk about the mathe- 
matics and not about the politics of how mathemati- 
cians are represented in the series? Can we use these 
materials without endorsing the excessive and glamor- 
ized violence on the program? Or without taking into 
account the difficult topics it tackles (sexual violence, 
war, terrorism, stalking), which students or members of 
their families may have experienced? In one episode, 


Charlie persuades the FBI to buy up all supplies of a 
new drug and so manipulate the market in such a way 
as to reduce demand to zero. Charlie intends this as 
a way to cut off crime at its source rather than sim- 
ply dealing with the symptoms of crime. At the end of 
the episode he briefly reflects that even this success 
is meaningless since another drug will always be wait- 
ing in the wings to fill the gap. However, this discus- 
sion quickly leads to a play fight between Charlie and 
Don, and there’s no space to think more broadly about 
the underlying causes of drug crime (racism, poverty, 
inequality) or to discuss alternative approaches to tack- 
ling drug addiction, such as decriminalization. What 
is lost, and gained, by extracting and abstracting the 
mathematics used in the show without addressing the 
wider questions raised by its use? Whether and in what 
ways these questions are part of mathematics goes to 
the heart of why we teach mathematics and what we 
want people to learn as a result. 
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VIII.9 Mathematics and Policy 

The continued success of mathematics as a subject of 
study and research depends on its importance being 
appreciated by those who control the funding that sup- 
ports it, as well as those who are in a position to make 
use of mathematics-based advice in formulating policy. 
Mathematicians arguably find it more difficult to make 
the case for their subject than colleagues in other “more 
practical” disciplines, such as biology, chemistry, and 
engineering, yet the case for mathematics is continually 
being made, and with much success. The four contribu- 
tors to this group of articles, from four different coun- 
tries, were asked to give their perspective on how to 
influence government as a mathematician. Their anec- 
dotes and advice should be of interest to all, and in par- 
ticular to those who wish to be involved in promoting 
mathematics to politicians. 

I. Ya-xiang Yuan: How Chinese 
Mathematicians Influence Government 

The longest-running television program in the United 
Kingdom, The Sky at Night (first broadcast in 1957), 
is on astronomy. However, the public’s appetite for 
astronomy was not so great prior to the second half 
of the twentieth century. In the preface to the third vol- 
ume of his classic Science and Civilisation in China pub- 
lished in 1959, Josef Needham (1912-96) included the 
following quote from Franz Kohnert (Vienna, 1888): 

Probably another reason why many Europeans con- 
sider the Chinese such barbarians is on account of 
the support they give to their Astronomers— people 
regarded by our cultivated Western mortals as com- 
pletely useless. Yet there they rank with Heads of 
Departments and Secretaries of State. What frightful 
barbarism! 

In ancient times almost all astronomers were also 
mathematicians. China therefore has a long tradition 
of mathematicians being well respected, and famous 
mathematicians are often put into high-ranking posi- 
tions in government. For example, Hua Loo Geng (1910- 
85) was a vice chairman of the Chinese People’s Political 
Consultative Conference (China’s top advisory body), 
and Ding Shi Sun (1927-), former president of Peking 
University, was a vice president of the National Peo- 
ple’s Congress of China (the Chinese parliament). An 
interesting phenomenon is that many university presi- 
dents in China are mathematicians, but this is unusual 


in the West. Another example of the popularity of math- 
ematicians in China is that two of their number, Hua 
Loo Geng and Chen Jing Run (1933-96), were chosen 
by China Central Television (the official Chinese gov- 
ernment television station) when it ran a competition to 
select the 100 most deserving winners of the “Touching 
China” award between 1949 and 2009. 

During the great Cultural Revolution in the 1960s, 
Hua Loo Geng was able to convince Chairman Mao 
Zedong (1893-1976) that mathematics could help the 
country to modernize. Hua traveled all over China 
teaching the golden section search method— a method 
for maximizing a unimodal function in an interval by 
successively reducing the length of the interval by 
the golden ratio (|(\/5- 1))— and other operational 
research techniques in factories, coal mines, and oil 
fields. The Chinese Academy of Sciences sent many 
research teams to the countryside to solve practical 
problems in the areas of transportation and produc- 
tion planning. A few years earlier, during the Great Leap 
Forward (1958-60), Chinese scientists were encouraged 
to solve real-world problems to help Chairman Mao’s 
ambitious campaign to rapidly transform the country 
from an agrarian economy into a modern communist 
society. In 1960, twenty-six-year-old Mei-Ko Kwan, a 
young lecturer at Shangdong Normal University, pub- 
lished the famous “Chinese postman problem” in graph 
theory. 

In 2002 the Chinese Mathematical Society hosted the 
International Congress of Mathematicians (ICM) in Bei- 
jing. As the general secretary of that congress, along 
with other mathematicians in China I had the chance 
to lobby government officials to support the meet- 
ing. We emphasized that sciences are vitally impor- 
tant for China's economic development and that math- 
ematics is the foundation of all other sciences. The 
president of the People’s Republic of China at that 
time, Jiang Zemin, attended the opening ceremony and 
handed the medals to the Fields Medal winners Lau- 
rent Lafforgue and Vladimir Voevodsky. The success of 
ICM 2002 promoted mathematics research, increased 
public awareness of mathematics, and attracted more 
young talented people to study mathematics in China. 

Chinese mathematicians are very influential in sci- 
ence and technology policy making in the country. 
For example, in the 1980s the late Shiing S. Chern 
(1911-2004) was able to persuade the Chinese gov- 
ernment to increase its support of mathematics, and 
this resulted in the establishment of the Tian Yuan 
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Foundation within the National Natural Science Foun- 
dation of China (NSFC). Following recommendations 
from many famous Chinese mathematicians, such as 
Shiing S. Chern, Siting Tung Yau, Lo Yang, and Gang 
Tian, the Chinese government has spent an enormous 
amount of money in the past twenty years setting up 
a number of mathematical centers in China. These 
include the Shiing S. Chern Institute of Mathematics 
at Nankai University, the Morningside Center of Math- 
ematics at the Chinese Academy of Sciences, the Bei- 
jing International Center for Mathematical Research 
at Peking University, and the Shanghai Mathematics 
Center at Fudan University. 

The main source of funding for mathematicians in 
China is the NSFC, which is very similar to the National 
Science Foundation in the United States. The NSFC 
splits its programs into three categories: research pro- 
motion, talent fostering, and infrastructure construc- 
tion for basic research. The category of research pro- 
motion is further subdivided into programs called the 
General Program, the Key Program, the Major Program, 
the Major Research Plan, and the Joint Funds to Interna- 
tional Joint Research Program. General Program grants 
are open to any application, while Key Program, Major 
Program, and Major Research Plan grants are for appli- 
cations relating to specific topics. Mathematicians have 
to lobby the NSFC if they want a particular topic to be 
listed as part of the Key Program or the Major Program 
or if they want to set up a Major Research Plan. 

Since my return to China from the University of Cam- 
bridge in 1988 I have been involved in events that have 
given me the honor of witnessing firsthand the ways in 
which mathematicians are able to influence the Chinese 
government. In the late 1980s and the early 1990s, the 
late Feng Kang (1920-93) was able to convince the gov- 
ernment to develop scientific computing. Due to his ini- 
tiative and the proposals he put forward, the Ministry of 
Science and Technology of China founded the State Key 
Laboratory of Scientific and Engineering Computing in 
1991 and, in the same year, launched the National Key 
Research Project “Large Scale Scientific and Engineering 
Computing.” The latter project eventually developed 
into the huge National Basic Research Project of China 
(also known as the 973 Program) and was approved by 
Chinese leader Deng Xiaoping in March 1997. 

Mathematicians also played an important role in 
stimulating China’s success in developing high-perfor- 
mance computing (HPC) hardware. In December 2002 
China’s supercomputer DeepComp 1800 ranked forty- 
third in the worldwide TOP500 list (www.top500.org). 


This breakthrough marked the beginning of China’s 
ascent in the field of HPC hardware development, and 
in 2012 China built what was for a time the world’s 
most powerful supercomputer: Tianhe-IA. The rapid 
development of HPC hardware in China in turn stim- 
ulated HPC research in the country, and in 2012 the 
NSFC launched a four-year Major Research Plan called 
“Algorithms for high performance scientific computing 
and computable modeling” with a budget of 180 million 
Chinese yuans (about US$30 million). 

In recent years the economic systems of developing 
countries such as China have been rapidly and dras- 
tically reformed in response to economic globaliza- 
tion, and new and challenging problems have there- 
fore been raised in energy, transportation, telecommu- 
nication, financial engineering, urban planning, health 
care, environmental pollution, natural resource con- 
sumption, and transnational logistics. 1 am sure that 
mathematics will play an increasingly important role 
in modeling and solving these practical problems. 

II. Maria Esteban: A Personal Experience 
in France and Europe of How to 
Influence Government as a 
Mathematician 

I am a Basque-French mathematician currently working 
in France, the president-elect of the International Coun- 
cil for Industrial and Applied Mathematics, past pres- 
ident of the Societe de Mathematiques Appliquees et 
Industrielles, and past chair of the Applied Mathemat- 
ics Committee of the European Mathematical Society. In 
recent years I have been involved with a European Sci- 
ence Foundation Forward Look project on “Mathemat- 
ics and Industry.” These three activities have helped 
to shape my views about how mathematicians can (or 
cannot) influence government science policy. 

In this article I will not only relate my personal expe- 
rience but will also describe how European colleagues 
have fared in their interactions with officials from their 
own governments and from European institutions. To 
avoid overgeneralization, it is important to bear in 
mind that the opportunities that a scientist has to influ- 
ence government policies and initiatives depend very 
much on which country they operate in: its size; its his- 
tory; and how knowledgeable, on average, its politicians 
are about science. 

In France, scientists have a long history of close rela- 
tionships with politicians. It is usually straightforward 
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for us to gain access to members of the government or 
to members of parliament and regional politicians. This 
is not the case for every mathematician, of course, but it 
is reasonably easy for those in leadership positions who 
invest the necessary time and energy in building con- 
tacts. Prominent scientists (Fields Medalists and recip- 
ients of Nobel Prizes, for example) may be contacted 
directly by politicians. Ministers of research or higher 
education and other top government officials may ask 
these scientists for meetings to exchange ideas or to 
receive advice on important decisions concerning sci- 
ence. Additionally, there are mathematicians who have 
been appointed to important positions: as managers of 
institutes or big projects, and as chairs or members of 
committees created to advise the government on par- 
ticular issues of scientific or technological importance, 
such as high-performance computing, the treatment of 
nuclear waste, food safety, and so on. Mathematicians 
who have been invited to participate in the crafting of 
advisory documents have influence on the government 
through these documents but also through personal 
meetings and discussions with government officials. 

The presidents of the mathematical societies, to- 
gether with mathematicians who are involved with the 
management of institutions relevant to the mathemat- 
ics community at a high or national level, can gain 
access to policy makers if they need to pass on infor- 
mation or intervene in an important issue. Contact does 
not usually start at the highest level; one often starts 
by talking to ministerial advisors. On the other hand, if 
the matter is important, it is relatively easy to arrange 
a meeting with someone at a higher level. It needs to be 
said, though, that talking to politicians does not nec- 
essarily translate into success in convincing them of 
what (in our opinions) is good for mathematics or for 
science; but it is a start. 

All this may make it sound as if France is a par- 
adise for mathematicians, especially when it comes to 
their relationships with politicians and decision mak- 
ers. While it is true that the situation in France is much 
better than in many other European countries, it is not 
always easy to convey our achievements or what our 
community needs or is looking for. By talking to pol- 
icy makers, however, we build relationships that make 
further contact and exchanges of views easier. 

A recent example of how mathematicians in France 
have managed to change something of importance 
to them is their contribution to the call, initiated 
three years ago, for the creation of centers of excel- 
lence. When the project was about to be launched, a 


senior official in the Ministry of Fligher Education and 
Research learned about two projects dear to math- 
ematicians that would not be covered by the origi- 
nal wording of the initiative. One of the projects per- 
tained to the French Institutes of Mathematics: the 
Institut Henri Poincare, which hosts thematic three- 
month programs; the Institut des Hautes Etudes Sci- 
entihques, a high-level research institution south of 
Paris; the Centre International de Rencontres Mathe- 
matiques, an international center near Marseille that 
hosts mathematics conferences; and the Centre Inter- 
national de Mathematiques Pures et Appliquees, which 
promotes mathematical research in developing coun- 
tries. The other project was concerned with the cre- 
ation of an agency designed to facilitate relationships 
between mathematicians and industry. When the offi- 
cial learned about these two projects he thought them 
interesting enough to ask for a slight change in the 
wording of the policy document. The result of this was 
that these two projects were covered in the initial delib- 
erations and were indeed both selected and funded. 
This would not have happened without the interven- 
tion of the mathematicians. Of course, for this to have 
happened, information had to be available and (some) 
mathematicians needed to keep up to date with news 
about official projects. This is a big task and requires a 
lot of time and effort. Fortunately, in France most math- 
ematicians in positions of power are willing to play 
the collaborative game, sharing information and jointly 
developing strategies. Collaboration is very important 
for achieving mathematicians’ desired outcomes. 

I do not have firsthand knowledge about other Euro- 
pean countries, only impressions gleaned from con- 
versations with colleagues and observations made dur- 
ing joint projects when colleagues from various coun- 
tries tried to reach their politicians for input or advice. 
The situation that mathematicians face clearly varies by 
country; as with many other issues in Europe, there is 
unfortunately no unified approach to scientists’ inter- 
actions with their governments. There are countries 
like France where mathematicians have managed to 
influence high-level politicians and decision makers, 
and where scientists are often consulted when new 
decisions concerning science and scientific programs 
need to be made. On the other hand, there are other 
countries where mathematicians do not seem to have 
any access to politicians and decision makers, at least 
at an institutional level. 

Whether mathematicians can influence the European 
decision makers is a very different story. I have had 
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several negative experiences in this regard, and other 
mathematicians I know have echoed this sentiment. 
Mathematics is almost entirely absent from European 
scientific policy decision making. In the last European 
framework program, mathematics was invisible. When 
asked about this, European officials would say, “But 
what are you saying? Mathematics is everywhere; math- 
ematics is linked to so many scientific fields that are 
explicitly part of the European programs!” But because 
mathematics is “everywhere,” when it comes to fund- 
ing it is nowhere. Mathematics does not seem to be a 
priority for the European Union, which chooses not to 
fund general scientific fields, funding only applied, con- 
crete fields, like life sciences or nanotechnology. The 
only exception to this is the European Research Coun- 
cil, which funds excellence through various programs 
such as “Starting Grants” and “Advanced Grants.” Math- 
ematics is clearly present there, as are other scientific 
fields. 

The European Mathematical Society has tried to open 
a dialogue with Brussels for many years, and it has 
tried more intensely still in the last six years or so. 
Several mathematicians have had meetings with high- 
level officials in Brussels and in the European Parlia- 
ment and have tried to explain our community’s sit- 
uation and its needs. Sometimes it has seemed like 
the officials were really listening to them, but the final 
outcome has never been positive. For instance, it is 
clear that in order to build important infrastructure 
at the European level (concerning digital mathematics, 
publishing, mathematics and industry, etc.), European 
funding is very important, but up to this point very lit- 
tle (almost nothing) has been forthcoming, despite the 
best efforts of the mathematicians and many promises 
from Brussels. 

Up to now I have discussed only direct interactions 
between mathematicians and politicians, but there 
could be, indeed must be, other ways for mathemati- 
cians to influence government. For instance, we should 
not forget the important role that the media can play 
in educating the public about the contributions of 
mathematics to science and to society, about impor- 
tant mathematical results and their implications and 
possible applications. A case in point is the broad 
attention French media have devoted in recent years 
to French mathematicians when they receive impor- 
tant international prizes. Politicians pay close atten- 
tion to the media and the ideas it promotes. The pres- 
ence of mathematics in the media can therefore help 


mathematicians reach beyond their own community to 
the spheres of power. 

Mathematicians may also influence public opinion 
and policy makers by taking action when news cov- 
erage misrepresents mathematics and its applications. 
For example, statistics is often used to justify decisions 
concerning health and drug design, but the statistical 
methods that are used are often incorrect and unrea- 
sonable. I know of several recent cases where statisti- 
cians have strongly attacked such misuse of statistics. 
This is an excellent way of defending the quality label 
that mathematics can provide and to refuse to let the 
subject be misused. In drug and food safety, the inter- 
vention of statisticians has been widely acknowledged, 
and statisticians have consequently been appointed to 
supervisory committees that give advice to decision 
makers. 

Of course, you need to be good at what you do to have 
influence. Only a well-organized mathematics commu- 
nity with high scientific standards will have a societal 
impact. The commitment of this community to scien- 
tific excellence also gives mathematicians the author- 
ity to exert their influence at the educational level. The 
mathematics community has to participate in educa- 
tion programs, and it needs to shape the training of 
mathematics teachers as well as the way mathemat- 
ics is being taught in schools. Education and research 
are the pillars on which the mathematics community is 
built, and mathematicians should therefore be involved 
with all committees and at all levels where decisions on 
these subjects are made. 

In closing, let me stress that all these activities can 
be time-consuming for the individuals involved. Build- 
ing networks, being well informed, and keeping infor- 
mation channels open take time and energy. But if we 
mathematicians want to be able to influence govern- 
ments in terms of both the projects they promote that 
could affect the organization and funding of our com- 
munity and their perception of the ways in which math- 
ematics can advance society, there is no way around the 
fact that some of us will have to be ready to spend the 
necessary time trying to promote our goals. 

III. James M. Crowley: SIAM and Science 
Policy in the United States 

The Society for Industrial and Applied Mathematics 
(SIAM), founded in 1952, is an international organi- 
zation based in Philadelphia, Pennsylvania. Since the 
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mid-1990s, SIAM has actively engaged in science pol- 
icy and advocacy on behalf of the mathematical sci- 
ences and computing from the perspective of applied 
mathematics and computational science. 

Science Policy and Funding in the United States 

Perhaps more so than in Europe and other parts of the 
world, in the United States science policy and funding 
is distributed across many agencies and involves many 
players. This leads to a system that is robust but that 
is sometimes difficult to understand in all its facets. 

Many different agencies affect science policy and 
funding in the United States in the areas of the math- 
ematical and computational sciences. For traditional 
areas of mathematics, the National Science Founda- 
tion (NSF) certainly plays the lead role, but in applied 
and computational mathematics, the Department of 
Energy’s Office of Science, the various Department of 
Defense funding agencies (the Air Force Office of Scien- 
tific Research, the Office of Naval Research, the Army 
Research Office, and the Defense Advanced Research 
Projects Agency), and parts of the National Institutes 
of Health are also important players. More recently, pri- 
vately funded foundations, like the Simons Foundation, 
have also played an increasing role. 

“Mission agencies” like the Department of Energy and 
the Department of Defense generally place a greater 
emphasis on research that contributes to solving spe- 
cific application problems of interest to that agency; 
therefore, they make grants to the areas within the 
mathematical sciences that they deem to be most rel- 
evant. NSF grants, on the other hand, tend to cover 
the whole spectrum of the mathematical sciences, from 
research deep within a core area of the discipline 
to multidisciplinary research taking in other areas of 
science or engineering. However, in both cases there 
is a spectrum of grants in each agency’s portfolio: 
from those that include very basic research within a 
given field to those that are research motivated and 
driven by application goals. In the case of the NSF, for 
example, multidisciplinary research can team applied 
mathematicians with scientists in other disciplines to 
simultaneously advance mathematical/computational 
methods as well as applications. 

The funding agencies are staffed by program man- 
agers: a mix of permanent employees and “rotators”— 
people from the research community who come to the 
agency for a two- or three-year period. Normally, both 
groups (permanent staff and rotators) are experts in the 


fields they manage (typically they have a related doctor- 
ate). The permanent staff provide stability and mem- 
ory; rotators provide new ideas and immediate contact 
with the research community from which they came. In 
mathematical sciences at the NSF, for example, the mix 
of permanent staff to rotators is roughly 50:50. 

Program managers in mission agencies may have 
more discretion over funding decisions because they 
are able to factor in the relevance of the proposal to 
the agency’s mission and also to consider the likelihood 
of application. The NSF relies most heavily on external 
peer review from the scientific community, but most 
funding agencies rely on peer advice to some degree. 

To get a sense of the scale of funding by the various 
agencies, consider the fiscal year 2011 (the fiscal year 
for the federal government in the United States runs 
from October 1 to September 30). The budget for the 
NSF Division of Mathematical Sciences that year was 
$240 million, of which approximately a third supports 
applied and computational mathematics. The Depart- 
ment of Energy contributed another $99 million to the 
mathematical sciences, much of this in applied and 
computational mathematics. The various agencies of 
the Department of Defense also supported $118 mil- 
lion in basic research in the mathematical sciences. Esti- 
mates were not available for the National Institutes of 
Health. 

What distinguishes funding in the United States from 
that in Canada and in many parts of Europe is the 
importance of individual grants to single researchers or 
to small groups, as opposed to block grants to depart- 
ments or universities. Such grants can provide funding 
for a researcher’s summer salary (typically one month, 
but possibly two) and/or for graduate student support. 
Such grants also provide funds for travel. Research 
grants to individuals and small groups account for 
the majority of funding in the mathematical sciences. 
Individual/small-group grants are also a major part 
of the applied mathematics program at the Depart- 
ment of Energy’s Office of Science, although recent pol- 
icy seems to indicate a shift toward more large-scale 
grants. Block grants to academic departments or to 
institutes account for only a small portion of the NSF 
Division of Mathematical Sciences portfolio, and they 
play little or no role in the portfolios of other agencies. 

Within the mathematics portfolio at the NSF, insti- 
tutes play a significant role. While accounting for only 
10% or so of the mathematical sciences budget at the 
NSF, the eight NSF Mathematical Sciences Institutes 
involve a large number of postdocs and visitors from 
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within the community in their programs each year 
(6000 in fiscal year 2011). 

The Budget Process 

In the United States, Congress has the power to appro- 
priate funds while the administration implements the 
spending of the funds through individual agencies. 
The federal budget is determined on an annual cycle 
through presidential and congressional action. 

Each year, the president issues a budget proposal to 
fund all the agencies of the federal government for the 
following year. The administration arrives at this bud- 
get request after a lengthy process during which each 
agency is charged with developing its budget with guid- 
ance from the Office of Management and Budget, which 
reviews each agency’s request and assembles all of the 
agency plans into the president’s budget request. 

Congress then reviews the president's budget (or the 
“request”) and develops legislation to fund the agen- 
cies called appropriations bills. These bills sometimes 
provide money at the agency or subagency level and at 
other times determine the specific amount available to 
individual programs or spell out other directives about 
the budget. Once appropriations bills are passed, agen- 
cies can then spend the funding to support research. 
In many years Congress fails to pass appropriations 
bills and instead passes a “continuing resolution” that 
enables an agency to continue to spend a designated 
amount of funds but generally prevents the beginning 
of new programs or initiatives. 

Advocacy 

Due to the very distributed system of budget devel- 
opment and appropriations, there are many points of 
access for scientific societies to interact with the pro- 
cess. Societies, including SIAM, play an important role 
in providing information, advice, and feedback from 
the communities they represent to those with respon- 
sibility for the budget and policy in the administration, 
in Congress, and across agencies. 

SIAM has a Committee on Science Policy (CSP) that 
regularly meets with key leaders to both gather infor- 
mation for our community and to provide feedback 
from the community to officials. Key points of contact 
include leaders within the funding agencies, congres- 
sional staffers and members of Congress, and repre- 
sentatives of the White House Office of Science Tech- 
nology Policy and Office of Management and Budget. 


SIAM presidents and the executive director also regu- 
larly provide testimony and meet with staff from rel- 
evant congressional committees, such as the House 
Appropriations Subcommittee on Commerce, Justice, 
and Science. 

CSP discussions with government officials often 
focus on the budget in applied and computational 
mathematics but may from time to time also take 
in specific policy issues. The House Science, Space, 
and Technology Committee, for example, has in the 
past asked the CSP to discuss issues related to high- 
performance computing and the role of computational 
science in developing new models and computational 
methods for solving grand challenge problems on 
emerging computer architectures. 

Another example of policy discussions was SIAM par- 
ticipationin 2012 debates on undergraduate education, 
which led to the President's Council of Advisors on 
Science and Technology issuing the report “Engage to 
Excel: Producing One Million Additional College Gradu- 
ates with Degrees in Science, Technology, Engineering, 
and Mathematics,” which called for a national initia- 
tive to promote science, technology, engineering, and 
mathematics (STEM) education in the first two years of 
college. A key finding of the report is that mathemat- 
ics education is a critical component of all undergrad- 
uate STEM degrees and that it is the current deficien- 
cies in mathematics learning that are partly to blame 
for the loss of STEM majors in the early college years. 
However, many SIAM members expressed reservations 
about a suggestion in the report that nonmathemati- 
cians should be engaged in the teaching of undergrad- 
uate mathematics for nonmajors. In response to the 
report, and to defend the role of mathematicians in 
mathematics education, the SIAM Education Commit- 
tee in partnership with the CSP prepared a formal white 
paper. The white paper highlights the importance of 
collaboration for effective mathematics education and 
makes recommendations on ways to strengthen K-16 
(kindergarten to four -year degree) mathematics educa- 
tion. The white paper was sent to the Office of Science 
and Technology Policy and the NSF. 

SIAM’S History in Advocacy 

In the late 1970s, the major mathematics societies in 
the United States (the American Mathematical Society 
(AMS), the Mathematical Association of America (MAA), 
and SIAM) jointly supported a congressional fellow, an 
individual from the community who served in a con- 
gressional office (or related position) for a year in order 
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to obtain experience in science policy and funding 
issues. 

However, it was not until the publication of the David 
Report in May 1984 that mathematics in general, and 
SIAM in particular, became seriously involved in the 
arena of science policy. The report, titled “Renewing 
U.S. Mathematics: Critical Resource for the Future,” 
proved pivotal in energizing funding support for the 
mathematical sciences and for propelling mathematics 
societies into engaging in science policy discussions. 
The report was produced by the Ad Hoc Committee 
on Resources for the Mathematical Sciences, chaired by 
Edward David of Exxon Research and Engineering Com- 
pany and with Kenneth Hoffman as executive director 
of the committee; it was the result of a project commis- 
sioned by the National Research Council and supported 
by SIAM and the AMS, along with six companies and five 
government funding agencies. 

Following completion of the David Report in the sum- 
mer of 1983, the Ad Hoc Committee on Resources for 
the Mathematical Sciences was terminated, but Kenneth 
Hoffman agreed to continue activities in Washington 
and submitted a budget to the three major societies (the 
AMS, the MAA, and SIAM) to support those activities. 
The SIAM board approved their share of this budget at 
their June 1983 meeting. 

Hoffman became the Executive Secretary for National 
Affairs under the aegis of the Joint Policy Board for 
Mathematics (JPBM) in 1984. In March 1985 the JPBM 
contracted Kathleen Holmay and Associates to do 
public relations for the public understanding of sci- 
ence (specifically mathematical sciences). This led to a 
decade or more of discussion among the members of 
the SIAM board about the role of SIAM in science policy. 

Seeking more oversight of the activities of the JPBM 
that it helped fund, the SIAM board asked the SIAM 
Committee on Relations with the Federal Government 
to play this role and changed the name of the commit- 
tee to the SIAM Committee on Science Policy in October 
1985. 

In the late 1980s the name of the office within the 
JPBM was changed to the Office of Governmental and 
Public Affairs, and by 1989 this had grown to become 
an office employing a full-time director, two adminis- 
trative assistants, a legislative liaison, and a public rela- 
tions consultant. SIAM’S degree of support wavered at 
times during the late 1980s and early 1990s over vari- 
ous issues, but the JPBM remained SIAM’S sole presence 
in Washington. 


SIAM used the JPBM as its forum for discussions 
with funding agencies and legislative staffers on issues 
related to science policy and funding in areas of interest 
to SIAM. 

When the AMS created its own Washington office, 
support for the JPBM waned, and in 1999 it was restruc- 
tured to become a coordinating body among mem- 
ber societies (eliminating its paid staff), a role that it 
continues to play today. 

With the elimination of the JPBM’s legislative liai- 
son (which had been a full-time staff position) and its 
public affairs consultant, SIAM sought to grow its own 
voice in science policy through its CSP. The CSP meets 
twice a year in Washington and coordinates the writ- 
ing of white papers and other policy activities through- 
out the year. To enable the CSP to carry out its role, 
in 2001 SIAM contracted Lewis-Burke Associates LLC, 
a government relations firm that specializes in science 
and technology policy, to provide support for its policy 
activities. 

SIAM, through the CSP, now regularly engages in dis- 
cussions with various federal agencies on issues related 
to applied mathematics and computational science. 

IV. Alistair D. Fitt: Making the Case for 
U.K. Mathematics Research in a 
Rapidly Changing Environment 

The Need to Make the Case 
(and Not Just for Funding) 

First, we have to be clear, in the U.K. system there is no 
alternative to “making the case for mathematics.” We all 
need to show in as many ways as possible that math- 
ematics research is a key contributor to the nation’s 
overall success. As always, perhaps the most impor- 
tant issue is research funding. It is a long time since 
public money was handed out to universities and they 
were simply trusted to do a good job with it (a system 
that still remains in surprisingly many nations), and 
there seems no chance of those days (or the minute 
academic salaries that went with them) ever returning. 
Public funding for science (and universities in general) 
increased significantly during the long tenure of the 
1997-2010 Labour government, but it has been more 
uncertain since. This period of uncertainty has neces- 
sitated much more involvement from the community to 
ensure that funding streams are protected. In addition, 
2011 saw the biggest change for decades in the way in 
which the U.K. higher-education sector is funded, and 
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by many financial definitions, universities in the United 
Kingdom are now close to being private, rather than 
public, ventures. 

Virtually all that is left of the public funding that used 
to allow us to charge relatively low student fees is the 
research money, currently about £5 billion per annum. 
Though by all measures our “dual research funding sys- 
tem” (where about two-thirds of the money is given out 
by the funding councils and the other third is allot- 
ted as a result of a research assessment exercise of 
some sort) has served us well since its introduction in 
1986, funding for mathematics research has certainly 
not increased. In fact, there is a case (depending on who 
you believe) for saying that the share of the research 
funding that finds its way to mathematics has shrunk 
considerably. This makes it all the more crucial to be 
sure that our subject is seen as being worth supporting. 

Furthermore, a new overriding political mission has 
evolved in the United Kingdom: transparency. The pub- 
lic should be able to hold the spending of public money 
to account and readily observe that their taxes have 
been well spent. This is a potential problem for our 
discipline, for mathematics is one of the very few sub- 
jects where many members of the public take proud 
delight in proclaiming the depths of their ignorance. 
Even those whose school experience was more positive 
are unlikely to have much real understanding of what 
mathematics research consists of. This means that it is 
absolutely crucial for us to have a portfolio of stories, 
arguments, citations, and reasons why we contribute 
so much to what is so horribly called U.K. PLC. Though 
it will prove hard to make the case to the public, we 
can influence politicians and governments, and these 
are the people who make the important decisions. 

A Single Point of Contact 

Now that we have agreed that “making the case” is 
unavoidable if mathematics is to thrive, we have to con- 
sider the best way of influencing government (by which 
I mean politicians, civil servants, funding agencies, and 
anybody else who might be in a position of power). 
Undoubtedly we need individuals who are skilled influ- 
encers, but this alone is not enough; we also need the 
people who matter in government to want to hear us 
and to actively seek our views. One of the hardest prob- 
lems for ministers to deal with is knowing who to turn 
to when they want information or wish to solicit views. 
Experience has shown that politicians and civil servants 
like to have a small number of trusted contacts who 


they regard as their key “go-to” people. Professional 
and learned societies are important, but having too 
many of them to canvass is distinctly unhelpful. We cur- 
rently have four learned societies in the mathematical 
sciences in England (the Institute of Mathematics and 
its Applications, the London Mathematical Society, the 
Royal Statistical Society, and the Operational Research 
Society), and there is also the Edinburgh Mathematical 
Society and possibly others too. The Council for Math- 
ematical Sciences does a valiant job in trying to synthe- 
size consensus from its member societies to present a 
single face to influencers in the outside world. Though 
the Council for Mathematical Sciences has undoubtedly 
had successes, a structure that involves so many differ- 
ent societies is hard for outsiders to understand. The 
result of the fragmentation of mathematical learned 
societies (and their unwillingness to merge and become 
a single voice) is that mathematics as a discipline is 
unable to exert the traction that, for example, the Royal 
Society of Chemistry or the Institute of Physics can, as 
single and well-known voices in their disciplines. 

Others will no doubt disagree that we need a single 
society, but there is ample evidence to suggest that, 
where the ability to influence government is key, a 
single point of contact is extremely important. 

What Works Best 

The rules for how to influence government have not 
changed: they are timeless and constant. Nevertheless, 
they do not tend to sit easily with the nature of mathe- 
matics as a discipline. Bluntly put, the most important 
detailed points are as follows. 

Limit your number of messages. Ministers are del- 
uged with information. You can probably afford to 
get only one or two points across, so do not over- 
complicate and risk confusing the minister. 

Be clear and concise in your points. Mathematicians 
tend to worry about detail, and detail rarely matters 
when it comes to influence. Do not obfuscate, and do 
not dilute the main points that you are making. 

Use a few “killer” statistics and stories again and 
again. Ministers like simple one-line facts to remem- 
ber, and they will have to be told them again and 
again. Prepare your ammunition well in advance, and 
decide carefully what the minister needs to know and 
will want to use again. 

Offer help. There is no future in complaining or being 
grudging. Government officials spend much of their 



VIII. 9. Mathematics and Policy 


961 


lives under attack, so offer them assistance rather 
than criticism. 

Concentrate on what is good first; then point out the 
threats. Even if you are vehemently against every 
one of the government’s policies, you need to begin 
with some positives. You can mention the possible 
difficulties and threats later. 

Understand the political difficulties. The government 
may want to help you but be unable to for political 
reasons. You need to understand these before you 
can have a good idea of what might reasonably be 
achieved. 

Speak for mathematics, not your own university or 
department. The government wants a broad view. 
If you are seen to be acting from any kind of self- 
interest, you will rapidly be discarded. 

Use the media to make your case. Politicians see the 
news media as a vital barometer of public opin- 
ion and are much more likely to be influenced by 
something that attracts headlines. 

Get included on missions abroad. Traveling with gov- 
ernment officials on delegations abroad can provide 
a considerable opportunity to talk in detail to the 
people that matter. 

Hire people. Ministers and “their people” are very 
used to dealing with publicists and professional lob- 
byists. There is no reason why we cannot spend hard 


cash hiring professional influencers who know how 
to deal with the government and how to get results. 

Fact: We Do Not Do It Very Well 

This short article gives plenty of suggestions and solu- 
tions for how to influence government, so how success- 
ful have we been in this area in the United Kingdom? 
Unfortunately, the answer in my view is simple: not 
very. If we are brutally honest, we have to admit that, 
as a profession, too often we do not take a strategic 
view, we fight among ourselves, and we give the overall 
appearance of a bunch of whingers who do not under- 
stand the wider agenda. Is this lack of success confined 
to the United Kingdom? Other articles in this volume 
will allow the reader to judge whether mathematicians 
in the rest of the world are better influencers than we 
are, but notable successes seem to me to be well hidden. 

Of course, influencing the government is very hard 
work. Politics is a transitory business, and it is frus- 
trating to build good connections only to see them van- 
ish as a new election sweeps new faces into power. 
However, though politicians change, the civil servants 
that are so important to them normally remain. We 
can influence these people, and we should be think- 
ing about this aspect of our subject and doing a much 
better job. 
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gated channels, 618 
Boltzmann equation, 430, 434-42 
Boltzmann operator, 430, 433, 
435-37, 440 

Boltzmann’s H theorem, 431-32, 441 


Bond number, 150 
bond pricing, 642-43 
boosts, 375 
Borda count, 891-95 
borehole electrical tomography, 505 
Borel summation, 637-38 
Borel transform, 639 
Bom approximation, 861-62, 864 
Born-Oppenheimer approximation, 
456, 847 

Bose-Einstein condensates, 151, 164, 
416, 442 

Bose-Einstein distribution, 416 
bosons, 416 

boundary conditions: in kinetic 
theory, open problems of, 443-44; 
for ODEs, 15; for PDEs, 16, 192; 
in solid mechanics, 509-10 
boundary element methods, 203 
boundary layer, 82, 147, 467, 469; in 
aerodynamics, 472-74; asymptotic 
matching at, 193-94, 217-18; in 
magnetohydrodynamics, 480; of 
sailing yacht, 599; separation of, 
474, 748; turbulence and, 731-32 
boundary-value problems for ODEs, 
16, 185; numerical solution of, 
304-5 

boundary-value problems for PDEs, 
16, 192; finite-difference methods 
for, 307-10; finite-element 
methods for, 311-13; for Laplace’s 
equation, 155-56, 179; singular 
behavior in, 124-25 
boundary-value problems in solid 
mechanics, 509-10, 512-13 
bounded function, 1 1 
bounded linear operator, 25, 238 
bounded mean oscillation, 100 
bounded set, 12 
bounded variation, 100 
bound states, 241, 414-15, 848, 851 
Bourbaki, 57, 70-72, 76 
Boussinesq approximation, 497 
brachistochrone curve problem, 219 
brain: network models of, 373; visu- 
alization of electric current in, 846. 
See also neuroscience, 
mathematical 
branch and bound, 569 
branch-and-cut method, 780 
branch cuts, 173-75, 179 
branch-point singularities, 177, 179 
bras, 412 

breadth-first search, 757-59 
bridges in networks, 369-71 
Brillouin zone, 849, 851 
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broadband beamforming, 543-44 
Bromwich contour, 179 
Brownian motion, 319; filtering 
theory and, 325; financial markets 
and, 641-42, 644-45; moral hazard 
and, 873; optimal control and, 322; 
option pricing and, 321; random- 
matrix eigenvalue distributions 
and, 426 

Bruggeman formula, 503 
Brunn-Minkowski theorem (inequality), 
90, 220, 224, 226, 552 
BSSNOK formulation, 682-83, 685-86 
bubbles, 735-37; in foams, 737-40; 

pressure difference and, 520 
Buchberger’s algorithm, 35, 574; 
inverse kinematic problem and, 

769 

Buckingham’s 17 theorem, 93 
bulk modulus: of isotropic elastic 
material, 511; seismic exploration 
and, 331-34 

Burgers equation, 138, 307; non- 
linear hyperbolic systems and, 

86-87; numerical relativity and, 

686; rarefaction wave and, 123; 
shocks associated with, 138, 196 
Burgers vortex, 470 
BZ (Belousov-Zhabotinskii) chemical 
reaction, 409, 459 

C, 829f, 830-32, 834-36, 838-39 
C++, 829f, 831-32, 835, 837-39 
Cahn-Hilliard equation, 17, 138-39 
calculus, 12-14 

calculus of variations, 147, 197, 
218-26; adverse selection and, 

873; contemporary applications of, 
224-26; direct method in, 221-22; 
in image processing, 225, 815; 
kinetic theory and, 438; relaxed 
(effective) energy in, 222-23 
Calderon problem, 334-35 
cancellation, in floating-point 
arithmetic, 97 
Caimy edge detector, 353 
canonical momentum, 108 
Cantor set, 390 

capital asset pricing model (CAPM), 
651-54, 656 

cardiac modeling, 623-27; visual- 
ization of electric fields in, 846 
Carnahan-Starling formula, 517 
carrying capacity, 157, 183, 188, 325 
Cartesian coordinates, 9 
Cartesian tensors, 130 
Case Histories (Atkinson, novel), 946 


catenary, 147, 183 
Cauchy, Augustin-Louis, 65, 89, 175 
Cauchy conditions, 16 
Cauchy integral formula, 176; 

generalization to matrices, 98 
Cauchy integrals, 180 
Cauchy interlace theorem, 269 
Cauchy principal-value integral, 180 
Cauchy problem: in general relativity, 
436; gradient flow and, 226; in 
kinetic theory, 434-37, 439-42; 
for Laplace’s equation, 156; for 
quasilinear hyperbolic PDEs, 86 
Cauchy-Riemann equations, 139, 

156, 174 

Cauchy-Riemann operator, 306 
Cauchy-Schwarz inequality, 23 
Cauchy sequence, 24 
Cauchy's residue theorem, 177-78; 

for evaluating integrals, 178-79 
Cauchy's theorem, 176 
Cauchy stress, 509-10 
Cauchy stress tensor, 130, 671 
causal horizons, 589-90 
cavitation, 736-37 
Cayley, Arthur, 51, 66, 182, 263, 575 
Cayley-Hamilton theorem, 112, 269 
Cayley mapping, 85 
Cayley’s theorem, 558, 565 
Cea’s lemma, 312-13, 317 
ceiling function, 42, 830 
cellular physiology, 616-20 
centered second difference, 95 
center manifold reduction, 410; 
pattern formation and, 461-62, 

464 

center manifold theorem, 387-88, 
395-96 

center of mass, 375-76 
central configurations, in JV-body 
problem, 773-74 
central forces, 376 
centrality of network, 364, 373, 801 
centrifugal acceleration, 490-91 
centrifugal force, 380-81 
centrifugal instability, 474-76 
Cercignani’s conjecture, 437-38, 441 
CFD. See computational fluid 
dynamics (CFD) 

CFL (Courant-Friedrichs-Lewy) 
criterion, 708, 719 
CG method. See conjugate gradient 
(CG) method 
chain rule, 14, 749 
channel, 546 
channel capacity, 549-52 
channel coding, 547-48 


channel coding theorem, 549; 

quantum version of, 552 
channels in membranes. See ion 
channels 

chaos, 82-83, 389; absent in two 
dimensions, 189; cardiac arrhyth- 
mias and, 624, 627; in communica- 
tion networks, 803; computational 
exploration of, 926; fluid dynamics 
and, 468, 472; homoclinic tangles 
and, 398; intermittency and, 
400-401; logistic map and, 157; 
Lorenz equations and, 158-59, 
391-92; molecular, 439-40; in 
neural networks, 876-77; in oscil- 
lator arrays, 928; period-doubling 
route to, 399; in piecewise-smooth 
dynamical systems, 769-70; popu- 
larization of, 908; quantum chaos, 
422, 426-28; replicator mapping 
and, 594; Smale’s horseshoe and, 
190, 390-91; symmetric, 410; 
weather prediction and, 711-12 
chaotic mixing, 392 
characteristic equation, of linear 
constant-coefficient ODE, 183-84 
characteristic polynomial, 25, 

268-69; of random matrix, 425 
cheap gradient principle, 751 
Chebfun, 258 

Chebyshev, Pafnuty Lvovich, 65 
Chebyshev polynomials, 22-23, 122; 
interpolation with, 30, 250, 

257-58; least-squares approxima- 
tion with, 257-59; numerical 
solution of PDEs and, 316, 318 
Chebyshev series, 258-59 
chemical potential: in Cahn-Hilliard 
equation, 138-39; in statistical 
mechanics, 416 
chemical reactions, 627-34; 
biochemical, 616-17; patterns in, 
409, 459; in turbulent flows, 
348-50. See also flames 
chemistry, computational, 237 
chimera states in oscillator arrays, 
928 

Chinese mathematicians and 
government policy, 953-54 
chip design, 804-8 
X 2 distribution, noncentral, 231, 233 
Cholesky factorization, 264-65; in 
definite generalized eigenvalue 
problem, 272; for solving normal 
equations, 274; for sparse 
matrices, 273 
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cholesteric phase of liquid crystal, 

523 

Christoffel relations, 581 
Christoffel symbols, 129, 682 
chromaticity diagram, 811 
chromatic number of graph, 561 
chromatography, 88 
circle group S 1 , 405 
circuit: Eulerian, 562-63; Hamilton, 
564, 779. See also electrical circuits 
circulant matrices, 21, 244; eigen- 
values of, 268 

circular ensembles of random matrices 
(COE and CUE), 420, 422-25 
Clairaut’s equation, 181-82 
classical mechanics, 107-10, 374-83; 
Hamiltonian, 109, 382; Lagrangian, 
108-10, 379-82; Newton-Cartan 
formalism and, 111; Newtonian, 

108, 374-78. See also Euler- 
Lagrange equations 
classification, 356-58 
Clausius-Duhem inequality, 451 
Clausius-Mossotti formula, 501 
Clebsch-Gordan coefficients, 417 
climate: definition of, 485; ocean 
circulation and, 492 
climate models, 491; sea ice and, 
695-98, 704 
climate sensitivity, 488 
cliques, 101-2, 361, 363, 365, 801-2 
cloaking, 733-35 
clock skew scheduling, 808 
Clojure, 829f, 831, 838, 838t 
cloning, in image retouching, 5-6. 

See also image inpainting 
closed form solution, 19 
closed-loop system, 88 
closed set, 12 
closeness centrality, 364 
clustering coefficient, 362-64, 

366-68 

clustering in a network, 361-63; 

reduction of spreading by, 369 
clustering techniques, 353, 358-59 
CMYK color space, 810 
cnoidal wave solution, 150 
codimension, 394 
codimension-two bifurcations, 

399-400 

cognitive radio, 826 
coherence of measurement matrix, 
824-26 

Cole-Hopf transformation, 138 
collaboration by authors, 915 
collateralized debt obligations, 

643-44 


collective choice, 870-71. See also 
voting systems 

Collins formula, for optical wave 
propagation, 677 
collisional relaxation, 431-34, 443 
collocation methods, 300-301, 305; 

spectral, 318 
colloids, 516-18 
color maps, 811-12 
color spaces, 808-13 
combinations, 553 
combinatorial optimization, 564-70 
combinatorics, 552-57 
combustion. See flames 
commodity markets, 646 
communication-avoiding algorithms, 
841 

communication graph, eigenvalue of 
matrix for, 798-99 
communication networks: efficient, 
557-58; evolving, 800-803; 

Internet architecture as, 883-87 
community structure of network, 
364-65, 370-71, 373 
commutator, 21, 413 
compact set, 12 

companion matrices, 36, 268-69 
compartmental models: of infectious 
diseases, 688, 694; of neurons, 874 
compiler, 828 
complementary colors, 811 
complementary error function, 81, 230 
complete graph, 361 
complete inner product space, 24 
completely integrable nonlinear 
PDEs, 151 

complete normed vector space. 

See Banach spaces 
complex analysis, 173-81 
complex arithmetic in programming 
languages, 834-35 
complex conjugate, 173 
complexes of reaction network, 631, 
634 

complex functions, 173-74; limits 
and continuity for, 1 1 
complexity classes, 45-46. See also 
computational cost of algorithms; 
NP-hard problems 
complexity of computer code, 836 
complexity theory, 83-84; evolving 
social networks and, 800 
complex numbers, 8-9 
complex plane, 8-9, 173 
complex step approximation, 40 
complex systems, 83-84 
complex variable, 173 


complex vectors of reaction network, 
631 

composite materials: analytic con- 
tinuation and, 698-703; effective 
medium theories and, 500-505; 
homogenization and, 103, 120, 

225, 501, 697-98; percolation 
theory and, 698, 700-701. See also 
homogenization 

composition of functions, derivative 
and, 14, 749 

composition vector, chemical, 628, 
632-33 

compressed sensing, 329, 814-15, 
823-27 

compressible flow, 87-88, 307 
compression of data. See data 
compression; image compression 
computational aeroacoustics, 786 
computational cost of algorithms: 
ideal basis and, 341; for matrix 
calculations, 44-45, 267, 272-73. 
See also complexity classes 
computational experiments, 2, 54; 
reproducible, 916-25. See also 
experimental applied mathematics 
computational fluid dynamics (CFD), 
598-99, 603; graphs in, 101; 
historical development of, 56, 336, 
338. See also fluid dynamics; 
numerical solution of PDEs 
computational science, 335-50; 
algorithmic paradigms in, 341; 
algorithms in, 341-42, 344-45, 
349-50; choice of basis in, 341; 
continuum-discrete duality in, 
342-43; curriculum design for, 

937; definitions related to, 335-36, 
350; development paradigm of, 
343, 344f; historical trends in, 
336-38; large-scale example of, 
348-50; multiphysics modeling in, 
53-54, 345-50; numerical condi- 
tioning in, 343; policy for, 956-59; 
polyalgorithms in, 341-42; prom- 
ise and limitations of, 338-40; 
remaining challenge of, 350; 
software packages for, 344-45; 
space-time trade-offs in, 342; 
themes in, 340-45; of turbulent 
reacting flows, 348-50; uncertainty 
quantification in, 340-41 
computed tomography. See X-ray 
computed tomography (CT) 
computer: high-performance com- 
puting, 839-43, 840f; historical 
impact of, 56-59, 73, 75-77 
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computer-aided proofs, 790-95, 922 
computer arithmetic, 6-7. See also 
floating-point arithmetic 
computer graphics. See digital 
imaging; image processing; 
visualization 
computer vision, 225 
concave function, 12, 90 
condensed matter physics, random 
matrices in, 427 

condition number, 26, 50, 263-64; of 
discretized operator, 343; distance 
to singularity and, 266; error 
analysis and, 27, 53; for matrix 
eigenvalues, 268; of random 
matrices, 426 

conductivity, effective: electrical, 
500-501, 504; percolation theory 
and, 698, 701 

conductivity, electrical: from bound- 
ary measurements, 334-35; effect- 
ive, 500-501, 504; in impedance 
imaging, 733-34; in impedance 
tomography, 334-35; in Ohm’s 
law, 477 

conductivity, thermal, rapidly 
oscillating, 103, 120 
cone: convex, 90; definition of, 90 
cones, retinal, 808 
configuration model of random 
graphs, 366-69 
configuration space, 379, 382 
confluent hypergeometric functions, 
232 

conformal invariance, 85 
conformal mapping, 84-86, 156, 179; 
in analysis of insect wing move- 
ment, 745 

congruence transformation, 271 
conical functions, 234 
conic optimization, 90, 283, 285, 290 
conjugate gradient (CG) method, 
276-77; for inversion of X-ray 
tomographic data, 867; for 
unconstrained optimization, 288 
conjugate subgroups, 408 
conjugate transpose of matrix, 21, 
263 

connected graph, 361, 557 
connection, 581, 583 
connection coefficients, 129, 144 
connectivity: algebraic, 370; of node, 
361-62 

conservation laws, 86-88, 122-24; in 
continuum mechanics, 449-51; 
finite-volume methods and, 

314-16; invariants and, 106-12; 


Noether’s theorem and, 107-9, 
381-82, 405; pattern formation 
and, 466-67; scalar, 191-92, 196, 
198-99, 316; symmetries and, 
107-11, 381-82, 405. See also 
angular momentum: conservation 
of; energy conservation; momen- 
tum conservation 
conservative forces, 376 
conservative vector fields, 156 
consistency, in Tikhonov theory, 329 
consistency + stability = conver- 
gence, 75, 298, 309, 462 
constitutive equations, 150, 451-53, 
458; for composite medium, 699; 
for granular materials, 666-67, 
671-72 

constitutive modeling of biological 
materials, 610-11 
constrained frontier, 650 
constrained materials, 452 
constraint qualification, 39 
constraints, 281-82, 285-86; in 
Internet design decisions, 886 
continuation, 37; numerical, for 
polynomial system, 574-75, 769 
continued fractions, algorithms for 
evaluating, 47 
continuity, 11 

continuity equation: for incompress- 
ible fluid, 156; in weather 
prediction, 706 

continuous-flow stirred-tank reactor, 
630 

continuous groups of transform- 
ations, 107 

continuous optimization, 281-93; 
basic principles of, 285-86; conic, 
283, 285, 290; nonlinear program- 
ming, 283, 290-92; in portfolio 
theory, 649-51; with uncertain 
objective or constraints, 292-93; 
unconstrained, 283, 285, 288-90. 
See also linear programming; 
optimization 

continuum mechanics, 446-58; 
biological materials and, 610-11; 
current research in, 455-58; 
essential structure of, 448-53; 
granular materials and, 665-66; 
history of, 3, 62-64; introduction 
to, 446; localization arguments in, 
450; phenomena in, 453-55; solid 
mechanics and, 506; tensor analy- 
sis for, 447-48. See also elasticity; 
fluid dynamics; granular flows 
contour integrals, 175-76, 178-80 


contravariant components, 128-29 
control systems, 88-89, 523-33; 
dimension reduction of, 117-19; 
fundamental limitations of, 

526-28; general structure for, 
524-26; historical background of, 
523-24; linear quadratic Gaussian, 
529-30; Lyapunov equation and, 
168; Lyapunov functions and, 
531-33; modeled as hybrid sys- 
tems, 104; nonlinear, 530-32; 
ongoing research in, 532-33; 
optimizing multivariable control- 
lers of, 528-29; Riccati equation 
and, 165-66, 530; simple example 
of, 524. See also optimal control 
convection: atmospheric, 488-89; 
bioconvection, 615; Lorenz 
equations and, 158-59; Rayleigh- 
Benard, 384, 458-59, 463, 476. 

See also advection 
convective derivative, 162, 697, 786, 
852 

convergence: consistency + stability 
and, 75, 298, 309, 462; of function 
or series to limit, 11; of iteration, 
34, 50-51, 346; of mathematical 
model, 248; of numerical solution 
of ODE, 298-99; of numerical 
solution of PDE, 309-10, 317; of 
powers of matrix, 113; in vector 
spaces, 24 

convergent series, 11, 174-75; vs. 

asymptotic expansion, 211-12 
conversion to a different problem, 
35-37 

convex analysis, duality in, 224 
convex cone, 90 

convex curve, Grayson theorem and, 
116 

convex entropy, 123 
convex function, 12, 90; in calculus 
of variations, 222 
convex hull, 90; in algebraic geom- 
etry, 572, 576; in color spaces, 812; 
in combinatorial optimization, 

570; in matrix field of values, 268 
convex inequalities, 552 
convexity, 89-90; voting profile 
structures and, 895 
convexity constraint, 873 
convex optimization, 283, 285, 524; 
affine functions in, 10; of utility 
function, 869 

convex quadratic programming, 282, 
288 

convex risk measures, 322, 326 
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convex set, 12, 89-90 
convolution, 105, 244; fast Fourier 
transform and, 95; of vectors, 18 
Cooley, James, 76-77, 94, 94n 
coordinate systems, 9; Lagrangian 
mechanics and, 380 
coordinate transformations, in 
tensor calculus, 580 
Coriolis effect, 490-95, 498-99; in 
climate models, 697; in weather 
prediction, 706-7 
Coriolis force, 381 
correlation matrix: nearest, 654; in 
portfolio theory, 653-54; from 
time series, 426 

cosmological constant, 582, 588 
cosmology, 125, 587-90 
cost function, in variational 
assimilation, 708-9 
Couette flow, 470, 475 
Coulomb force, 377-78, 415; 

in plasmas, 433, 435 
Coulomb friction, 667-68, 670 
Coulomb gas, 426 
Coulomb gauge, 161 
countercurrent multiplication, 623 
counting problems, 553 
coupling of models, 350 
Courant, Richard, 1, 71-74, 306, 
310-11, 337, 719 

Courant-Fischer theorem, 134, 269 
Courant-Friedrichs-Lewy (CFL) 
criterion, 708, 719 

covariance matrix: A J CA framework 
and, 665; estimation of, 652-55; in 
mean-variance portfolio analysis, 
648-51; in signal processing, 
538-39, 542 

covariance principle, 107, 109-10; 

tensors and, 128-30 
covariant components, 128-29 
covariant derivative, 129, 581 
Cramer’s rule, 44 
Crank-Nicolson method, 310 
credit default swaps, 643 
critical points: of conformal mapping, 
179; of nonlinear system of ODEs, 
37-38 

cross product, 27 
crystal lattice: defects in, 850; 
electronic structure of, 848-51; 
nonlinear, solitons in, 150-51; 
symmetries of, 404 
crystalline surface structure, 220 
CT. See X-ray computed tomography 
(CT) 
curl, 27 


curse of dimensionality, 28-29, 261, 
339-40, 642 

curvature, 129; in foams, 737; 
minimal surfaces and, 198; 
space-time, 579, 581-82 
curvature scalar, 129 
cutting-plane method, 570; for 
traveling salesman problem, 780 
cycle, in graph, 557 
cyclic groups, 405 
cyclomatic complexity, 836 
cyclonic flow, 494, 497 
cylinder functions. See Bessel 
functions 

cylindrical coordinates, 9 

Dahlquist, Germund, 75, 294, 
298-300 

d'Alembert, Jean, 55f, 62-63, 448; 

wave equation and, 171, 193 
d’Alembert paradox, 146-47 
d'Alembert’s equation, 683 
d’Alembert’s formula, 171, 193-94 
Dale’s principle, 876-77 
damped Duffing oscillator, 186, 186n 
damped harmonic oscillator, 378 
Dantzig, George, 73, 89, 283, 287, 

780 

Dara 6 Briain: School of Hard Sums 
(TV series), 943 

Darboux, Jean-Gaston, 182, 575, 
637-38, 871 
Darcy’s law, 697 

dark adaptation, Lambert W function 
and, 153-54 
dark matter, 771-74 
Darrieus-Landau instability, 854, 856 
data analysis, 350-60; data repre- 
sentation in, 351-54; dimension 
reduction in, 28, 351, 354-55; as 
fourth paradigm of science, 336; 
future of, 359-60; history of, 
351-52; pattern recognition in, 

351, 355-59 
data assimilation, 133 
data compression, 547-52; by 
singular value decomposition, 
126-27. See also image 
compression 

data mining, 351. See also data 
analysis; text mining 
data repositories, 919n 
data transmission, 547-49 
data types, in programming 
languages, 836 
data visualization, 843-47 
Dawson’s integral, 35 


DCT (discrete cosine transform), 814 
Deborah number, 666-67 
de Bruijn sequences, 563-64 
decision-feedback equalizer, 541 
decision making under uncertainty, 
133 

decision trees, 356-57 
Dedekind eta-function, 931 
defective matrix, 112-13 
deficiency zero theorem, 633-34 
definite integral, 14 
definitions first versus examples 
first, 900-901, 903 
deformation, 448; homogeneous, 

509. See also strain 
deformation gradient, 449, 508-9; 

tissue growth and, 614 
deformation rate of granular 
materials, 666, 671-72 
de Giorgi-Nash theorem, 432 
de Giorgi’s minimizing movements, 
226 

degree of node, 361-62 
degree of vertex, 557 
degrees of freedom, 379 
delay differential equations, 17, 53; 
in epidemiology, 691; Lambert W 
function and, 154 

delta function, 139-41; Green func- 
tions and, 85, 140; position oper- 
ator and, 413 
delta-function well, 414 
DEM (distinct element method), 
667-70 

density functional theory, 847, 851 
dependency problem, of interval 
arithmetic, 105 
dependent variables, 10, 181 
depletion interaction, 517 
depth-averaged systems of PDEs, 
714-15 

depth-first search, 758 
derivative-free optimization 
methods, 28, 289-90 
derivative markets, 640-44 
derivatives, 12-14; chain rule for, 14; 
of complex functions, 174; frac- 
tional, 18; product rule for, 14. 

See also automatic differentiation; 
partial derivatives 
derogatory matrix, 113 
descriptor systems, 88, 118 
detection theory, 544 
determinants, 21-22; computational 
complexity of, 46; rarely com- 
puted, 264 

deviatoric deformation, 666, 670 
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deviatoric stress, 511 
DFE (disease-free equilibrium), 689, 
691, 693 

DFT. See discrete Fourier transform 
(DFT) 

diagonalizable matrices, 112-13; 

unitarily diagonalizable, 277 
diagonally dominant matrices, 263, 
275-76, 279, 345 

diagonal matrix, 21; max-plus, 797 
diameter of graph, 361, 363, 366-67 
Dido’s isoperimetric problem, 219 
difference equations, 18; of control 
system, 88; logistic map, 157-58, 
398-99, 401 

differential-algebraic equations, 18, 
88 

differential equations, 14-18; 
complex analysis and, 179-80; 
symmetry and, 190, 407-10. See 
also delay differential equations; 
fractional differential equations; 
ordinary differential equations 
(ODEs); partial differential equa- 
tions (PDEs); stochastic differential 
equations 

differential inclusions, 770 
differential operators: Schrodinger 
operators, 241-43, 245, 848-51; 
spectrum of, 240 
diffusion: in flames, 854; linear 
elliptic equations and, 198. See 
also reaction-diffusion equations 
diffusion-advection-reaction prob- 
lems, 312 

diffusion (heat) equation, 16-17, 142, 
156, 192, 241; attempted micro- 
scopic derivation of, 443; behavior 
of solution of, 195; Black-Scholes 
equation and, 137-38, 142; 

Burgers equation and, 138; 
coupled pair of, in capillary-fill 
device model, 864-65; Gaussian 
filter and, 353; Green’s theorem 
and, 857-59; ill-posed example of, 
204-5; initial-value problem for, 
193; unsteady, 307, 310 
diffusion in living organisms, 609; 
across renal capillary, 622; in 
heart, 625-26; membrane ion 
channels and, 618 
diffusion tensor: cardiac, 625-26; 

visualization of, 845 
digital imaging, 5-6; color spaces 
and, 808-13; compressed sensing 
in, 825-26; dimension reduction 
in, 28. See also image processing 


Digital Library of Mathematical 
Functions, 227 

digital message or medium, 546-47 
digital object identifier (DOI), 914, 
923-24 

dihedral groups, 405 
Dijkstra’s algorithm, 758-59, 807 
dilation, symmetry of, 404 
dimensional analysis, 90-93 
dimension of vector space, 22 
dimension reduction, 28-29, 117-19; 
bifurcations and, 395-96; of com- 
plex systems, 84; in computational 
fluid dynamics, 599; in data analy- 
sis, 28, 351, 354-55; of dynamical 
systems, 28, 118, 387-88, 395-96; 
in uncertainty quantification, 132. 
See also compressed sensing 
Dingle, Robert, 637-39 
Dirac delta function. See delta 
function 

Dirac equation, 142-44, 675 
Dirac notation, 412 
directed graph, 101-2, 563-64 
Dirichlet boundary conditions, 16, 
192; finite-difference method and, 
308-10; for Laplace’s equation, 
156, 201-2 

Dirichlet-to-Neumann map, 334 
discrepancy principle, 205 
discrete cosine transform (DCT), 814 
discrete Fourier transform (DFT), 
94-95, 105, 265, 534-35; 
diagonalization of circulant 
matrices by, 244 
discrete optimization, 568-69; 
traveling salesman problem, 565, 
568, 778-81. See also 
combinatorial optimization 
discrete spectrum of self-adjoint 
operator, 240; of Schrodinger 
operator, 242 

discretization, 95-96, 307-8 
disease-free equilibrium (DFE), 689, 
691, 693 

dispersion of waves, 194-95; in 
tsunami modeling, 719; in 
waveguide, 678 
dispersion relation, for Swift- 
Hohenberg equation, 460 
displacement, 506-7 
dissemination platforms, for 
research, 922-24 
dissipation inequalities, 198-99 
dissipative systems, 378. See also 
drag; friction; viscosity 


distance: between nodes of graph, 
361, 364. See also norms 
distinct element method (DEM), 
667-70 

distributed control, 532-33 
distributional solutions, 199 
distributions. See generalized 
functions (distributions) 
divergence of vector field, 27, 191 
divergence theorem, 27; integration 
by parts and, 197 
divergent series, 81, 212, 634-40 
divide and conquer algorithms, 
42-44, 569 

DNA: knots and links of, 752-54; 

systems approach to, 880 
DOI (digital object identifier), 914, 
923-24 

domain monotonicity, 240-41 
dominant balance, principle of, 
213-14, 217-18 
dot product, 27 
double bubble conjecture, 791 
double pendulum, 384, 391 
double poles, 177-78 
doubling map, 384-85, 390 
downward continuation, 857, 859-60 
drag, 378, 746-47; on golf ball, 
746-49 

drag crisis, 747-49 

drum: hearing the shape of, 17, 246; 

vibrational modes of, 137 
dual decomposition, 532 
duality: in calculus of variations, 224; 
in linear programming, 286-88, 
567-69; in nonlinear optimization, 
283, 569, 662; in semidefinite 
programming, 290 
dual space, 99 

Duffing oscillator: conservative, 190; 

damped, 186, 186n 
dynamical systems, 185-90, 383-93; 
data assimilation for, 133; dimen- 
sion reduction of state space of, 

28, 118, 387-88, 395-96; double 
pendulum, 384, 391; doubling 
map, 384-85, 390; equivariant, 

392; high-precision computations, 
929-30; historical background of, 
57, 383-84; linearization of, 386, 
393-94; nonsmooth, 769-71; 
piecewise-smooth, 401-2, 769-71. 
See also bifurcation theory; chaos; 
flows 

dynamic programming, 530-31, 
569-70; investment theory and, 

645 
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dynamic programming languages, 
831-32 

dynamic programming principle, 
322-23 

dynamos, magnetohydrodynamic, 
481-83 

Eady problem, 497-98 
earth mover’s distance, 355 
earthquakes, tsunamis caused by, 
713, 715-18 

Earth system dynamics, 485-500; 
atmospheric properties in, 487-89; 
dynamical processes in, 492-98; 
fluid dynamics of atmosphere and 
oceans in, 490-92; introduction to, 
485; ocean-atmosphere coupling 
in, 498-500; oceanic properties in, 
489-90; outlook for, 500; sea- 
surface temperatures in, 498-500; 
temperature of atmosphere in, 
487-89, 491, 495; temperature of 
surface in, 485-88 
Eckart-Young theorem, 126 
Eckhaus instability, 463 
ECMWF global model, 710-11 
economics, 868-73; Leontief’s input- 
output models in, 279. See also 
finance 

eddies, 492, 494, 497. See also 
turbulence; vortices 
Eddington-Finkelstein coordinates, 
584-86 

edge detection, 353 
edge of graph, 101, 557 
effective medium theories, 500-505; 
bubbles in, 737. See also composite 
materials; homogenization 
efficiency, economic, 870-72 
efficient frontier, 648-52, 654, 657 
Eiffel, Alexandre Gustave, 747 
eigenfunctions, 17, 236; generalized, 
848; Mathieu functions, 159-60; 
of Sturm-Liouville problem, 185 
eigenvalue problems: generalized, 
271-72; of integral equations, 17; 
of matrices, 267-72; nonlinear, 

247; of PDEs, 17; quadratic, 247, 
272; Sturm-Liouville, 16, 185; 
in text mining, 889-90 
eigenvalues of linear operator, 25, 
236; calculation of, 243-45; 
divergent series and, 639; 
generalized, 848. See also 
Schrbdinger operators; spectral 
theory 


eigenvalues of matrix, 267-72; of 
Hermitian matrix, 25, 267-72, 848; 
of KKT matrix, 662; of max-plus 
matrix, 798-99; multiplicity of, 

112; of nonnegative irreducible 
matrix, 279; polynomial roots found 
as, 36; of random matrix, 420-28; 
sensitivity to perturbation, 268 
eigenvector centrality, 364 
eigenvectors, 25, 236; calculated from 
nonlinear system of equations, 36; 
of KKT matrix, 662; of max-plus 
matrix, 798-99; in quantum theory, 
241, 412-13; of random matrix, 

420; Rayleigh-Ritz approximation 
of, 134 

Einstein-de Sitter universe, 588-89 
Einstein’s field equations, 144-46, 

579, 582-88; well-posedness of, 

683. See also general relativity 
Einstein summation convention, 128, 
130, 580 

Einstein tensor, 145, 680 
Ekman pumping, 495, 499-500 
elastic-ideally plastic response, 511, 
513 

elasticity: calculus of variations and, 
224-25; constitutive relation of, 
452-53; examples of, 513-15; 
Hooke’s law and, 149-50, 513; 
incremental loading and, 37; iso- 
tropic, 452, 454-55, 511; linear, 511; 
in liquid crystals, 522-23; of mem- 
branes, 521; plastic deformation 
and, 511; polymer structure and, 
519; stresses in cracked object 
and, 125. See also solid mechanics 
elasticity number of granular flow, 
667-68 

elasticity of utility, 645 
elastic modulus of granular materials, 
666-67 

elastic reciprocity, 514 
elastic regime of granular materials, 
668 

elastoplastic regime of granular 
materials, 668-70, 672 
electrical circuits: A J CA framework 
and, 663; mechanical analogies to, 
605-8; neuronal membrane repre- 
sented by, 874; RLC circuit, 15 
electrical impedance tomography, 
334-35 

electric field, 377-78; biological, 
visualization of, 846; membrane 
transport and, 618; of radar, 

861-63 


electricity pricing, 646 
electric permittivity: effective, 501-4, 
698-99, 701-2; of free space, 161, 
377 

electric potential, 377 
electromagnetic field tensor, 580 
electromagnetic potentials, 156, 161 
electromagnetic waves, 673-74; in 
composite medium, 699 
electromagnetism, classical, 377-78. 

See also Maxwell’s equations 
electronic structure of solids, 847-51 
electron-to-atom (e/a) ratio, 456 
elementary functions, 19; implicitly 
elementary, 153; polynomial 
approximations of, 759-61 
elliptic coordinates, wave equation 
in, 160 

elliptic functions, 20 
elliptic integrals, 230; moments of, 
931-32 

elliptic PDEs, 17, 198, 306-7; finite- 
difference methods for, 307-10; 
finite-element methods for, 

311-13; software for, 921-22; 
uniformly elliptic, 306, 317. 

See also Laplace’s equation 
El Nino, 499-500 
Emacs, 828, 837, 913 
emergent properties, 84, 374, 597, 
616; of Internet, 885; systems 
biology and, 879 
emulsions, 520-21 
energy: of bending, 220; Cahn-Hilli- 
ard equation and, 138; in contin- 
uum mechanics, 451; control 
system stability and, 531; diffusive 
term of PDE and, 138; elasticity 
and, 224; in general relativity, 
145-46; homogenization and, 225; 
minimum principles for, 663; in 
Newtonian mechanics, 376; in 
quantum mechanics, 111, 142, 

167, 411-12, 414-15, 417-19; in 
special relativity, 110-11, 142; in 
turbulence, 726-31; Willmore, 220 
energy balance in combustion, 852 
energy conservation, 107, 109, 147; 
in general relativity, 582, 588; in 
Hamiltonian systems, 295, 297, 
302, 405; in Lagrangian mechanics, 
381; in Newtonian mechanics, 376; 
shockwave and, 721; wave equa- 
tion and, 197; in weather predic- 
tion, 706 

energy-efficient algorithms, 842 
energy-efficient buildings, 763-67 
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energy estimates, in solving PDEs, 

197 

energy levels, 241, 848; in crystals, 
849 

energy-momentum-stress tensor, 

582 

energy-momentum tensor, 144-46 
energy norm for elliptic boundary- 
value problem, 313 
energy of activation, for combustion, 
852-54 

engineering: historical background 
of, 56, 59, 63-71, 73; spectra of 
vibrations in, 236-37 
ensemble learning, 358 
ensembles, matrix, 420. See also 
random matrices 
EnsembleVis framework, 846 
enthalpy: shockwave and, 721; in 
vortex sound equation, 786 
entire function, 176 
entropy, 431-33; of colloid, 516; 
convex, 123; inequalities related 
to, 437-38; of information, 549-51; 
maximum entropy principle, 
131-32; of polymer, 519; shock 
wave and, 720-21, 723; topological, 
390, 483 

entropy estimates, for conservation 
laws, 198-99 

enzyme kinetics, 617, 629 
Eotvos number, 736 
epidemiology, 687-94; network 
analysis in, 368-69, 372 
equation of state: in general relativ- 
ity, 582, 588; in weather predic- 
tion, 706 

Equatorial Counter Current, 499 
equilibrium point: as minimum of 
potential energy, 376; of nonlinear 
system of ODEs, 37-38, 185-90, 
386. See also fixed point of 
dynamical system 

equilibrium problems, 293; coupled, 
345-46; A J CA framework for, 
662-63 

equioscillation, 30, 259. See also 
Remez algorithm 

equivalent martingale measure, 321 
equivariance, 392, 407 
Erdos number, 363 
Erdds-Renyi model, 365-66, 801 
ergodicity, 83; of random-matrix 
eigenvalues, 425; turbulent flows 
and, 726 

ergodic theorem, 851 
ergodic theory, 83, 392 


error analysis: backward, 26-27, 75, 
275; forward, 26-27; in numerical 
linear algebra, 274-75. See also 
rounding errors 

error-correcting codes, 548, 555-57 
error function, 19, 230; complemen- 
tary, 81, 230 

errors in modeling, classification of, 
53 

essential spectrum of linear 
operator, 240 

Euclidean space: mechanics in, 

107-8; ti-dimensional, 22-23 
Euler, Leonhard, 59, 62, 64, 183, 635, 
773 

Euler-Bernoulli equation, 17 
Euler buckling, 514-15 
Euler equations, 146-47, 163, 192; in 
aerodynamics, 472; in fluid dynam- 
ics of sport, 599; of gas dynamics, 
316; shallow- water variant of, 
167-68 

Eulerian circuit, 562-63 
Eulerian description in continuum 
mechanics, 448-51, 457; granular 
materials and, 667, 671 
Eulerian tour, 562-63 
Euler-Lagrange equations, 134, 147, 
197-98, 220-21, 223, 379-81; 
derivation of, 379; numerical 
solution of ODEs and, 303-4 
Euler limit of Reynolds number, 
471-72 

Euler methods: with finite differ- 
ences, 310; for numerical solution 
of ODEs, 293-98, 300-301, 303 
Euler numbers, 91, 93, 227 
Euler-Poisson-Darboux equation, 

170 

Euler’s constant, 148, 228, 931 
Euler’s formula, 9, 173 
European mathematicians and 
government policy, 954-56 
evanescent waves, 674-75 
even function, 10 
event horizon, 585-87, 686 
evolutionarily stable state, 593 
evolution equations, 16-17, 241; 
coupled, 345-47; for granular 
materials, 672; in kinetic theory, 
430, 433; Korteweg-de Vries 
equation as, 150 
exchange of stability, 394 
excitable cells, 619-20; of heart, 
623-27 

expectation value, in quantum 
mechanics, 411, 413 


experimental applied mathematics, 
925-33; examples of, 926-32; 
introduction to, 925-26; limits of, 
932-33. See also computational 
experiments 

explicit Euler method, 293-96, 298 
exponential function: polynomial 
approximations of, 759-60; power 
series for, 20. See also matrix 
exponential 

exponential integral, 32 
exponential integrators, 301 
exponentially small function, 211 
exponential random graphs, 367 
externalities, market, 871-72 
extreme points, 90 

fabric tensor, 672 

factor analysis, in covariance matrix 
estimation, 652-53 
factorial function, 18. See also 
gamma function; Stirling’s 
approximation 
Falkner-Skan equation, 473 
Faraday’s law, 477 
Faraday tensor, 162 
fast Fourier transform (FFT), 94-95; 
in digital signal processing, 535, 
543; historical background of, 
75-76, 94; matrix factorization 
and, 265 

fast multipole method, 775-78 
FBP. See filtered back-projection (FBP) 
algorithm 

fear of mathematics, 950-52 
feasible point, 38, 282, 285-86 
feasible region, 286 
feasible set, 282, 285-87; convex, 283 
feasible solutions, 565 
feature selection techniques, 354-55 
feedback, 88-89, 523-24; in cellular 
regulation, 881; in communication 
systems, 551. See also control 
systems 

Feigenbaum constant, 399 
Feigin maps, 770 
Fenchel transform, 224 
Fermi-Dirac distribution, 416 
Fermi level, 850 
fermions, 416 
Feynman, Richard, 380 
Feynman-Kac formula, 642 
Ffowcs Williams-Hawkings equation, 
785 

FFT. See fast Fourier transform (FFT) 
fiber tractography, 845 
Fibonacci heap, 758 
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Fibonacci numbers, 18; algorithms 
for computing, 43 
Fick’s law, 618 

fictitious forces, 380-81. See also 
Coriolis effect 
Fiedler eigenvector, 665 
field equations: for granular mater- 
ials, 666, 668, 670. See also 
Einstein’s field equations 
field of values, for eigenvalues of 
matrix, 268 

Filippov system, 769-70 
filtered back-projection (FBP) 
algorithm, 818-23, 868 
filtering problem, in stochastic 
analysis, 325-26 
filters: for dimension reduction, 
354-55; in image processing, 353; 
in nonlinear programming, 292; 
in signal processing (see signal 
processing) 

finance, 640-48; stochastic analysis 
in, 320-21, 326, 641-42, 644-45, 
647. See also portfolio theory 
financial crisis of 2008, 640-41, 
644-46 

financialization of commodity 
markets, 646 

financial networks, 372-73 
finite-difference methods: in cardiac 
modeling, 626; historical back- 
ground of, 75; for PDEs, 307-10, 
337; stability properties of, 
309-10; in tsunami modeling, 718; 
in weather prediction, 707-8 
finite differences, 95-96, 272 
finite-element methods, 96, 310-14; 
for calculating eigenvalues, 245; 
conforming, 313; discontinuous 
Galerkin, 313, 315f, 718; historical 
background of, 73-74; mixed, 313, 
662; optimization and, 662; soft- 
ware for, 921-22; in solid mech- 
anics, 513; stabilized, 312-13 
finite impulse response (FIR) filter, 
533-34, 536, 540, 543 
finite-volume methods, 314-16; in 
tsunami modeling, 718 
first-digit law, 135-37 
first law of thermodynamics, and 
atmosphere, 488-89, 491, 706 
first principles, working from, 33-34 
Fisher information matrix, 594 
Fisher metric tensor, 594 
Fisher’s equation, 17 
Fisher’s exact test, 577 


Fisher’s fundamental theorem of 
natural selection, 592-93 
five-body problem, 774, 792 
fixed-income markets, 642-43 
fixed-point iteration, 34-35, 37, 346 
fixed point of dynamical system, 
386-88, 393; neuronal, 874-75, 

877. See also equilibrium point 
fixed point of market, 871 
flames, 852-57; introduction to, 

852- 53; multidimensional laminar, 

853- 55; planar adiabatic, 853; 
turbulent, 348-50, 852, 855-56 

floating-point arithmetic, 96-97; high- 
precision, 835-36, 841, 925-26, 
929-32; IEEE standard for, 6-7, 97, 
105, 835; programming languages 
and, 835-36; summation algo- 
rithms and, 41, 843. See also 
rounding errors 
floor function, 42, 830 
flops, 43, 267, 839, 839n 
flow map, 383, 387 
flow on graph, maximizing, 558-60, 
564-65 

flows, 187-88, 393; Hamiltonian, 401; 
numerical methods and, 302-3; 
phase space and, 382, 401; of 
piecewise-smooth systems, 769-70 
fluid dynamics, 467-76; aircraft 
noise and, 783-86; of atmosphere 
and oceans, 490-92; biological 
systems and, 610; Cauchy-Riemann 
equations and, 139; of compress- 
ible flow, 87-88; computer graph- 
ics for, 844; contact line paradox 
in, 169; enlivening lectures on, 
934-35; historical development of, 
62; historical development of 
simulation in, 338; instabilities in, 
474-76; introduction to, 467-68; 
kinematics in, 449, 468-69; level 
set method in, 116; Maxwell’s kin- 
etic theory and, 431; scaling law 
for pressure drop in, 91, 93; 
shocks in, 122-24; singularities in, 
125; of sport, 598-604; Tricomi 
equation in, 170; two-phase flows 
in, 116; velocity potential in, 156; 
vorticity in, 469-72 (see also 
vortices). See also aerodynamics; 
boundary layer; continuum 
mechanics; Euler equations; 
magnetohydrodynamics; Navier- 
Stokes equations; Reynolds 
number; shallow-water equations; 
turbulence 


fluorescence capillary-fill device, 
864-66 

flux expulsion, 479 
flux freezing, 478 
flux function, 314-15 
foam drainage, 739 
foams, 737-41. See also bubbles 
focus-focus, 397-98 
Fokker-Planck equations, 434-36, 
438, 440, 443; for distribution of 
algae, 615 

Foppl-von Karman equations, 614 
force: in biological systems, 610, 613, 
615; in continuum mechanics, 
450-51; in Newtonian mechanics, 
375-78; in turbulent dynamics, 

727. See also fictitious forces; 
stress 

force dipole, 615 

force-directed graph drawing, 103 
Ford-Fulkerson algorithm, 559-60 
forest, 102 

Formula 1 racing cars, 598, 605, 
608-9 

Forth, 834, 838t 
Fortran, 828-32, 834-39 
forward and backward linear 
prediction, 540, 543 
forward difference, 95-96 
forward error, 26-27 
forward map, 327-28 
forward problem, 50, 336 
forward propagation of uncertainty, 
132 

four-body problem, 774, 792-94 
four-color map theorem, 562 
Fourier-Galerkin spectral method, 
317-18 

Fourier series, 23, 196, 260-61 
Fourier transform, 104-5; in digital 
FIR filters, 534; evaluated by 
Cauchy’s residue theorem, 178-79; 
Lebesgue spaces and, 99-100; 
optical fields and, 674-75, 679; 
short-time, 100; in solving PDEs, 
196, 244. See also discrete Fourier 
transform (DFT); fast Fourier 
transform (FFT) 

fractal dimension, of Arctic melt 
ponds, 704 

fractal properties, of Lorenz 
attractor, 929-30 

fractal theory: development of, 926; 

in popular culture, 949-50 
fractional differential equations, 18 
fractional Gaussian noise, in Internet 
traffic, 885 
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fracture mechanics, 125, 515 
Frank free energy, 522-23 
Frechet derivative, 13 
Fredholm alternative, 26, 202-3, 213 
Fredholm integral equations, 17, 200, 
202-3 

Fredholm operators, 246 
free boundary conditions, of 
quantum graph, 247 
free boundary problems, 196, 221, 

325; of flame propagation, 854-55 
free discontinuity problems, 221, 225 
free particle, quantum mechanical, 

414 

French mathematicians and 
government policy, 954-55 
frequencies, distributing among 
phone towers, 561 
frequency of oscillation, 236 
Fresnel diffraction formula, 675-76 
Fresnel propagator, 676-77 
friction, 378; in granular mechanics, 
667-68, 670; swimming microorgan- 
ism and, 615; in weather predic- 
tion, 706. See also drag; viscosity 
Friedmann equation, 588-89 
Friedmann-Lemaitre universes, 

588-89 

friendship paradox, 367 
Frobenius, method of, 180 
Frobenius norm, 25, 266 
Froude number, 853 
Fuchsian ODE, 184 
full orthogonalization method, 276 
full waveform inversion (FWI), 

332-34 

functional analysis, 99-101, control 
theory and, 529; integral equations 
and, 203; spectral theory and, 
236-248 

functional programming language, 

828, 831 

functionals, 134; calculus of 
variations and, 219 
functions, 10-11. See also elementary 
functions 

function spaces, 99-101 
fundamental solution of PDE, 125 
fundamental theorem of algebra, 18, 
250, 571 

fundamental theorem of calculus, 14 
funding of mathematics, making the 
case for, 953-61 

gain matrix, 709 

galactic dynamics, mean-field theory 
in, 433-35, 444 


galaxy formation, 590 
galaxy mass, 772 
Galerkin methods: discontinuous, 
313, 315f, 718; projection, 118; 
spectral, 316-17; for tsunami 
modeling, 718 

Galerkin orthogonality property, 312 
Galilean group, 109, 111, 375 
Galilean invariance, 107, 109-10 
Galileo, 61, 109-10, 336, 374, 505 
game theory: adaptation and, 

593-94, 596; control systems and, 
530-31; equilibrium problems in, 
293; financial markets and, 647; 
utility maximization and, 870 
E-convergence, 222-25 
gamma distribution, 231 
Gamma driver condition, 685 
Gamma freezing condition, 685 
gamma function, 19-20, 148, 174, 
228-29; analytic continuation of, 
179; confluent hypergeometric 
function and, 232; hypergeometric 
function and, 229; Riemann zeta 
function and, 229 
GAMS (General Algebraic Modeling 
System), 838 

gas dynamics equations: finite- 
volume methods for, 315-16 
gauge: Coulomb, 161; Lorenz, 161; 

in numerical relativity, 683, 686 
gauge transformations, 161-62 
Gauss, Carl Friedrich, 63, 65, 76, 94, 
227, 229, 231, 244 
Gaussian basis functions, 261 
Gaussian beam, 676 
Gaussian curvature, 220; leaf growth 
and, 614 

Gaussian (normal) distribution, 230; 
in likelihood function, 659; 
random matrices from, 420, 824; 
turbulent flows and, 726; 
uncertainty principle and, 927 
Gaussian elimination (GE), 35, 264; 
backward error analysis and, 75, 
275, 337; for banded matrices, 

272; computational cost of, 44, 
267; with partial pivoting, 265, 
273; for sparse matrices, 273 
Gaussian ensembles of random 
matrices (GOE and GUE), 420-25 
Gaussian filter, 353 
Gaussian noise: in Bayesian example, 
660; fractional, in Internet traffic, 
885; in linear quadratic optimal 
control, 529-30; in signal process- 
ing, 536-37, 544-45 


Gaussian wave packet, 414 
Gauss methods, for numerical 
solution of ODEs, 300, 302-5 
Gauss-Seidel iteration, 276, 279 
Gauss-Seidel multiphysics coupling, 
345, 348 

gene action and regulation, 880 
General Algebraic Modeling System 
(GAMS), 838 

generalized coordinates, 380-82 
generalized eigenvalue problem, 
271-72, 848 

generalized functions (distributions), 
140-41; Banach spaces of, 100-101 
generalized harmonic formalism, 
683-84, 686 

generalized minimal residual 
(GMRES) method, 276 
generalized momenta, 380, 382 
general relativity, 107, 111, 129-30, 
579-90; basic structure of, 579-83; 
black holes and, 585-87, 684, 
686-87; Cauchy problem in, 436; 
cosmology and, 587-90; field equa- 
tions in, 144-46, 579, 582-88, 683; 
further issues in, 590; numerical 
studies of, 145, 582-83, 680-87; 
Schwarzschild solution in, 583-87, 
684 

general solution of ODE, 182 
generating function, for Bessel 
functions, 137 

genetic algorithms, for traveling 
salesman problem, 781 
genomics, comparative visualization 
tool for, 845-46 

geodesic deviation equation, 582 
geodesics, 129, 144, 581; calculus of 
variations and, 219; of network, 
361, 364 

Geoduck, 787-90 
geometric mean, for Hermitian 
positive-definite matrices, 280 
geometric modeling, 575-76, 787-90 
geometric series, 174, 229 
geophysical inversion for near- 
surface properties, 327, 331-34 
geopotential, 491, 494 
geostrophic balance, 493-97 
Gerchberg-Saxton iterative 
algorithm, 677 

Gershgorin’s theorem, 263, 267-68 
giant component of graph, 361, 366, 
368-69 

Gibbard-Satterthwaite theorem, 893, 
895 

Ginibre ensemble, 420 



Index 


975 


Gini index, 357 

Ginzburg-Landau equation, 462-65, 
467 

Ginzburg-Landau theory, 148-49, 

225 

Givens rotations, 265 
glide reflection, 403 
global optimization, 285, 795; in 
networks, 886-87 

global warming: sea ice and, 694-95. 

See also greenhouse effect 
gluing bifurcation, 400 
GMRES (generalized minimal 
residual) method, 276 
GNU MPC library, 835 
GNU MPFR library, 835 
GNU Octave, 832 
golden ratio, 18 

Golden-Thompson inequality, 280 
Goldman-Hodgkin-Katz equation, 
617-18 

Goldstine, Herman, 62-63, 73, 426 
Goldstone modes, 463, 466 
golf ball flight, 746-49; dimples and, 
749 

gonorrhea, 691-92 
Good Will Hunting (film), 947 
Google PageRank algorithm, 4, 48, 
276, 364, 755-57 
Gould, Stephen Jay, 591 
governmental policy, 953-61 
GPUs (graphics processing units), 832 
gradient flow equation, nonlinear, 

198 

gradient flows, 226 
gradient operator, 27, 115 
gradient projection, 291 
gradients, natural, 593-94 
gradient vector, 14, 27, 191; auto- 
matic computation of, 751 
Gragg’s method, 298 
Gram-Schmidt orthogonalization, 

265 

grand mean, 655 

granular flows, 665-73; multipolar 
effects in, 672-73; regimes of, 
667-71; size segregation in, 668, 
672-73 

granular materials, 665-66; foams 
and, 738-39 
graph coloring, 561 
graph databases, 373-74 
graph density, 361 
graphical user interfaces (GUIs), 
830-31 

graphics for research papers, 914, 
918 


graphics processing units (GPUs), 832 
graph theory, 101-3, 552-53, 

557-64; basic concepts of, 557; 
combinatorial optimization and, 
564-65; complex systems analyzed 
with, 84; Laplacian matrices in, 

662, 664-65; searching a graph 
and, 757-59; spectral analysis and, 
246-47; web graph and, 755-57. 
See also network analysis 
graph traversal, 757-59 
gravitational boosting, 926 
gravitational field: Green’s theorem 
and, 857-59; linear, 108; Poisson’s 
equation for, 307, 857 
gravitational potential, 775 
gravitational redshift, 583-85 
gravitational waves, 145-46, 582-83, 
687 

gravity: bubbles and, 736; as conser- 
vative force, 156, 377; in fluid 
dynamics of Earth, 490-91; 
Newtonian, 376-77, 771-72, 775. 
See also general relativity 
Grayson theorem, 116 
grazing bifurcations, 770-71 
greatest lower bound, 1 1 
greedy algorithms: for cheap net- 
work problem, 558; for com- 
pressed sensing, 825; not gener- 
alizable, 566; for spanning tree 
problem, 566 
Greek alphabet, 8, 9t 
Green functions, 140, 858-60; for 
Laplace equation, 85, 125; for 
Poisson equation, 125, 140; for 
radar source, 861; resonances of 
Schrodinger operators and, 243; 
for unit disk, 85 

greenhouse effect, 486-88; energy- 
efficient buildings and, 763; 
regulation of CCU emissions and, 
646. See also global warming 
Green’s theorem, 857-60 
Grobner basis, 573-74, 577; 

Buchberger’s algorithm for, 35 
Gross-Pitaevskii equation, 151 
group theory, 404-5; voting 
paradoxes and, 894-95 
growth factor, in Gaussian 
elimination, 275 
GUIs (graphical user interfaces), 
830-31 

gyres, ocean, 492-96 
gyroscope, reduced-order model of, 
119 


gyroscopic sensor of fruit fly, 744, 
746 

gyroscopic system, 272 

H 2 -norm, 529-30 
H„-norm, 529-31, 539 
Haar condition, 256-57, 259, 261 
Haar measure, 420 
Hadamard, Jacques, 50, 72, 204, 328 
Hadamard matrix product, 801 
Hadley circulation, 491-92, 499 
half-planes, 9 

Hall’s marriage theorem, 560 
hallucinations, visual, 878 
Halmos, Paul, 3, 909, 939 
Hamilton, William Rowan, 64, 66, 374 
Hamilton circuit, 564, 779 
Hamiltonian: in classical mechanics, 
109, 382; Dirac, 143; in Hamilton- 
Jacobi equation, 191; of network, 
367; in optimal control problem, 
323 

Hamiltonian flows, 401 
Hamiltonian matrix, 21; algebraic 
Riccati equation and, 165-66 
Hamiltonian mechanics, 382 
Hamiltonian operator, 167, 411-12, 
847-48; approximation methods 
using, 418-19; modeled by random 
matrix, 427; for multi-electron 
atom, 417; periodic, 849, 851 
Hamiltonian systems: integral curves 
of, 183; Noether's theorem for, 

405; numerical solution of differ- 
ential equations for, 295-97, 

302-4; orbits of, 189-90, 401; 
oscillatory behavior in, 304; 
Painleve equations written as, 164 
Hamilton-Jacobi-Bellman equation, 
198, 322; in portfolio theory, 320, 
644-45 

Hamilton-Jacobi equation, 191, 199 
Hamilton’s equations, 295, 382; 

bifurcation theory and, 401 
Hamilton’s principle, 380 
Hamming code, 548, 555-57 
Hankel matrices, 254-55 
Hankel’s loop integral, 179 
hapten, 864, 864n 

Hardy, G. H., 3, 70, 72, 579, 636, 946 
Hardy space, real, 100 
Hardy-Weinberg equilibrium, 578-79 
harmonic analysis: approximation 
theory and, 255; in image process- 
ing, 815; kinetic theory and, 437 
harmonic coordinates, 683 
harmonic functions, 155-56, 201-2 
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harmonic oscillator, 376; damped, 

378; quantum mechanical, 185, 

231, 412, 848 

Hartman-Grobman theorem, 186 
Hartmann flow, 480 
Hashin-Shtrikman bounds, 702 
heap, of graph traversal algorithm, 
758-59 

heaps of pieces model, 797-800 
heart, 623-27, 846 

heat. See thermal energy of combustion 
heat conduction: backward, 204-5; 

in composite medium, 103 
heat equation. See diffusion (heat) 
equation 

heat flux, in continuum mechanics, 

451 

Heaviside, Oliver, 59, 66, 70, 635, 673 
Heaviside function, 850 
heavy-ball methods, 288 
heavy-tailed distributions, in Internet 
traffic, 885-87 
Hebbian plasticity, 878-79 
Heisenberg uncertainty principle, 57, 
413 

Helfrich-Canham free energy, 520-21 
helicity: conservation of, 472; 

magnetic, 478, 482-83 
Helmholtz, Hermann von, 472, 475, 

808 

Helmholtz equation, 156, 207-8, 233, 
307, 674; modified with refractive 
index, 678 

Helmholtz’s laws of vortex motion, 

472 

herd immunity, 689-90 
Herglotz function, 699, 702 
Herglotz wave function, 208 
Hermite equation, 185 
Hermite-Gaussian beams, 676 
Hermite polynomials, 122, 231-32; in 
optics, 676; and quantum mechan- 
ical harmonic oscillator, 231, 412 
Hermitian matrices, 21; eigenvalues 
of, 25, 267-72, 848; inequalities 
on, 280; Lanczos iterative method 
for, 277; positive-definite, factor- 
ization of, 264-66, 273; in 
quantum mechanics, 412-13; 
Rayleigh quotient of, 134. See also 
self-adjoint matrices 
Hertzian elastic contact, 515, 668 
Hessenberg matrix. See upper 
Hessenberg matrix 
Hessian matrix, 14, 90, 285; in 
Hamiltonian systems, 304; in 
nonlinear optimization, 289, 291 


heteroclinic orbits, 189-90, 384, 387, 
393, 464 

hierarchical organization, in 
networks, 363 

high contact (smooth fit) principle, 
324-25 

high-frequency trading, 646-47 
high-performance computing, 

839-43, 840f; in China, 954 
high-precision arithmetic, 835-36, 
841, 925-26, 929-32 
Hilbert’s Nullstellensatz, 571 
Hilbert spaces, 24; functionals on, 
134; in quantum physics, 107, 167, 
412-13; reproducing kernel, 100; 
spectral theory in, 238-39 
Hilbert’s sixteenth problem, 630 
history of applied mathematics, 
55-78; classroom use of, 934-35; 
before Industrial Revolution, 

59-63; introduction to, 55-59; late 
nineteenth century to World War II, 
66-72; mathematical tables and, 

66, 74-75; in nineteenth century, 
63-66; periods of, 59; pioneers in, 
58-59; during and after World 
War II, 72-78 

HITS (hyperlink-induced topic 
search) algorithm, 4-5, 756 
HIV/AIDS, 692-94; viral image 
reconstruction, 815 
Hodgkin-Huxley model, 874-76 
Hohmann transfer ellipses, 926 
holomorphic functions. See analytic 
functions 
holonomy, 581 

homoclinic bifurcations, 190, 393, 
396-98, 400-401 
homoclinic orbits, 190, 384, 387 
homoclinic snaking, 401 
homoclinic tangencies, 398 
homoclinic tangles, 398 
homogeneous material, 510-11 
homogenization, 103, 120, 193, 225, 
501; biological tissues and, 611; 
bubbles and, 737; granular mater- 
ials and, 665; inverse, 702-3; sea 
ice microstructure and, 697-98, 
703. See also effective medium 
theories 

homotopy. See continuation 
Hooke’s law, 149-50, 376, 505, 513 
Hopf bifurcation, 388-89, 395, 400; 
in epidemiology, 691; in Hodgkin- 
Huxley model, 875; Lorenz equa- 
tions and, 159; pattern formation 
and, 462-63; symmetry and, 


409-10; van der Pol oscillator and, 
189 

Horner’s method, 46-47, 760-61 
horseshoe, Smale, 190, 384, 390-91 
hospitals, optimal sensor location in, 
763-67 

Householder reflectors, 265 
H theorem, 431-32, 441 
Hubble parameter, 587 
hubs, 362-64, 369-70 
hull. See convex hull 
Hurwitz zeta function, 929 
Huygens’s principle, 676 
hybrid models, of cell-to-cell 
communications, 882 
hybrid systems, 103-4; piecewise- 
smooth, 401-2, 769-70 
hydraulic jump, 715, 718 
hydrodynamic distribution, 431-32 
hydrodynamic instability, of flames, 
854-56 

hydrodynamic limit of Boltzmann 
equation, 443 

hydrodynamics. See fluid dynamics 
hydrodynamic theory of flame 
propagation, 855-56 
hydrogen atom, 237, 415 
hydrostatic balance, in atmosphere, 
487, 489, 494 

hydrostatic models, in weather 
prediction, 706 

hydrostatic pressure, in tsunami 
modeling, 715-16 
hyperasymptotics, 638-40 
hyperbolic conservation laws, 86-88, 
122-24 

hyperbolic fixed point, 386-87 
hyperbolic PDEs, 17, 307; nonlinear, 
in tsunami modeling, 715-16, 718; 
in numerical relativity, 682-86; 
quasilinear, 86-87, 122-24; soft- 
ware for, 920. See also wave 
equation 

hyperbolic solution, of dynamical 
system, 394 

hypercube, volume of, 29 
hypergeometric equation, 184-85 
hypergeometric functions, 19, 185, 
229-30, 233; confluent, 232 
hypersphere, volume of, 28-29 
hypocoercivity, 438, 443 
hypoellipticity, 438, 443 
hypoplastic models, 667, 671-72 

ice. See sea ice 
ice ages, 487 

ice-albedo feedback, 695-96, 703 
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ideal gas law, 516-17 
ideals, 571 
identity matrix, 21 
identity transformation, 405 
HR (infinite impulse response) filter, 
536, 540-41 

ill-conditioned problems, 26; inverse 
problems as, 328-29 
ill-posed problems, 26, 50; inverse 
problems as, 201, 204-5, 328-29; 
regularization methods for, 205-6, 
208, 329, 867 

image compression, 28, 813-14; by 
singular value decomposition, 
126-27. See also JPEG compression 
image denoising, 814-15 
image inpainting, 225, 814-15. See 
also cloning, in image retouching 
image processing, 813-16; calculus 
of variations in, 225, 815; cloning 
tools for, 5-6; color and, 813; for 
data representation, 353-54; 
dimension reduction in, 28; IPOL 
project for publishing about, 
922-23; level set method in, 116; 
wavelets in, 31. See also digital 
imaging; visualization 
Image Processing OnLine (IPOL), 
922-23 

imaginary part, 8, 173 
imaginary unit, 8, 173 
imitation, 596 

immersed interface method, 745 
impedance imaging, 733-34 
impedance tomography, 334-35 
implicit differential equations, 

181-82 

implicit Euler method, 294-96, 300 
implicit Euler scheme, gradient flow 
and, 226 

implicitization problem, 571-72, 576 
Inada conditions, 645 
incidence matrix, edge-node, 664 
inclusion principle, in interval 
analysis, 105-6, 791 
incomplete beta function, 231 
incomplete gamma functions, 231 
incompressibility constraint, 598 
incompressible flow. See Navier- 
Stokes equations 
incompressible materials, 450 
incremental loading, 37 
indefinite integral, 14 
independence, model of, 577 
independent-component analysis, 

543 

independent variable, 10, 181 


index contraction, 128 
indicial equation, 184-85 
induced norm, 24-25 
industrial mathematics: airport bag- 
gage screening, 866-68; history of, 
57-59, 66, 70, 72, 78; pregnancy 
testing kit modeling, 864-66; 
teaching and, 940-43 
inerters, 604-9 

inertia, and gravitational force, 579, 
581 

inertial frames, 107, 109-11, 130, 

375 

inertia number of granular flow, 
667-68 

inertia of Hermitian matrix, 271 
inertia tensor, 130 
infectious diseases, 368-69, 372, 
687-94 

infimum (inf), 1 1 
infinite eigenvalues, 271 
infinite impulse response (HR) filter, 
536, 540-41 

infinite series, 175; convergence of, 
11, 174-75. See also power series 
infix notation, 833 
inflection points. See saddle points 
influenza, 689-91 
information asymmetry, 872-73 
information geometry, 592-93 
information measures, 549-50, 552 
information retrieval, 355-56 
information theory, 73, 545-52; 
adaptation and, 593, 595; kinetic 
theory and, 437-38, 441; sampling 
rate in, 826-27 

initial conditions: for ODEs, 15; for 
PDEs, 16, 192 

initial-value problem, 15, 182, 187, 
293 

inner multiplication, 128 
inner product, 22 

inner product space, 22; norm on, 23 
insect flight, 743-46 
insertion algorithm, 41 
integer linear optimization, 568-69 
integer relation detection, 926 
integrable differential equations, 

151, 193, 384 

integral, 14; Cauchy integral, 180; 
Cauchy principal-value integral, 
180 

integral equations, 17, 200-208; 
computerized tomography and, 
206-7; coupled Volterra of second 
kind, 865; historical background 
on, 201-3; ill-posed problems and, 


204-6; introduction to, 200-201; 
inverse scattering and, 207-8; 
numerical solution of, 203-4; 
singular, 180; stochastic, 319 
integral transforms, 104-5; for solv- 
ing PDEs, 196. See also Fourier 
transform; Laplace transform 
integrated circuit design, 804-8 
integrating factors, 183 
integration by parts, 14, 197 
integro-differential equations, 17; 
coupled nonlinear Volterra, 865; 
in kinetic theory, 433; for neuron 
firing rates, 877 

interferometric synthetic-aperture 
radar, 863 

interior-point methods: in conic 
optimization, 290; in linear 
programming, 287-88; in non- 
linear programming, 291 
intermittency in dynamical systems, 
400-401 

Internet: architecture of, 883-87; 
collaboration using, 915; writing 
for, 901-3. See also web page 
ranking; Web sites 
Internet traffic, long-range 
dependence in, 885-86 
interpolation, 29-30, 248-55; for 
model order reduction, 118-19; 
multivariate, 261-62; orthogonal 
basis functions for, 257-58 
interpreter, 830 
interval, space-time, 110 
interval analysis, 105-6; computer- 
aided proofs via, 790-95 
interval arithmetic, 105, 790 
interval bisection, 791-94 
invariance properties, 51-52 
invariant manifolds, 383, 386-87 
invariant probability measure, 83 
invariants, and conservation laws, 
106-12 

invariant set of map, 83 
invariant subspace, 25 
inverse iteration, 270 
inverse of square matrix, 2 1 
inverse problems, 50, 200-201, 

327- 35; in baggage screening, 
866-68; Bayesian approach to, 133, 
329-30; for composite materials, 
504-5; concepts of, 327-30; as 
data fitting, 328, 332; of electrical 
impedance tomography, 334-35; 
geophysical, for near-surface prop- 
erties, 327, 331-34; ill-conditioned, 

328- 29; ill-posedness of, 328-29; 
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inverse problems ( continued ): 
integral equations in, 201, 204-8; 
level set method for, 116; linear 
and nonlinear, 328; outlook for, 
335; a priori information and 
preferences in, 329; of scattering, 
207-8, 327, 335; in spectral theory, 
236; Tikhonov regularization in, 
329; for tsunami source estima- 
tion, 716-17; uncertainty quanti- 
fication in, 132-33, 330; of X-ray 
computed tomography, 206-7, 

327, 330. See also tomography 
inverse scattering transform, 151 
inverse-square law, 377-78 
inverse synthetic-aperture radar 
(ISAR), 862, 864 
involutory matrix, 265 
ion channels, 617-20, 624-25, 627; 
of neurons, 874 

IPOL (Image Processing Online), 
922-23 

iPython. See Project Jupyter 
irreducible matrix, 279 
irrotational fields, 156 
Ising integrals, 930-31 
isolated essential singularity, 177 
isomonodromy problem, 164 
isoperimetric inequality, 89 
isoperimetric problems, 219-20, 224 
isosurface extraction, 844 
isothermal chemical network, 630, 
634 

isotropic elasticity, 452, 454-55, 511 
isotropic material, 510-11 
isotropic space, 156 
isotropy group, 510 
isotropy subgroups, 408-10 
iteration, 34-35; convergence of, 34, 
50-51 

iterative algorithms, 48 
iterative hard thresholding, 825 
iterative methods in numerical linear 
algebra, 275-77, 279 
iterative refinement, 264, 342, 841 
Ito processes, 319-20, 641 

Jacobian matrix, 37, 121; computa- 
tion of, by automatic differenti- 
ation, 750-52; of coupled system, 
345-48; finite difference approxi- 
mation to, 122, 347; linearization 
of dynamical system using, 37-38, 
121-22, 186, 386-88, 393-94, 397; 
transformation of tensor compo- 
nents and, 129 
Jacobi iteration, 34, 276, 279 


jamming, 739 

Jaumann derivative, 666, 671 
Java, 829f, 831, 836, 838 
Jeffery-Hamel flow, 470 
Jensen’s inequality, 90 
John’s ultrahyperbolic equation, 867 
Jordan block, 238 

Jordan canonical form, 112-13; func- 
tions of matrices in terms of, 98 
Jordan curve theorem, 562 
Joukowski mapping, 86 
JPEG compression, 28, 547, 812-15, 
823 

Julia, 829f, 831, 832f, 833, 835, 838 
Jupyter Notebook, 832f, 833, 919 
Jurassic Park (film), 949-50 

Kac, Mark, 17, 75, 246, 434, 436-37, 
439-40 

Kadomtsev-Petviashvili equation, 

151 

Kalman-Bucy filtering theorem, 326 
Kalman filter, 529-30, 545; optimal 
sensor location and, 763-64 
Kalman gain vector, 539 
Kantorovich, Leonid, 56, 68, 72, 89, 
225-26, 750 

Kantorovich condition, 791 
Kantorovich inequality, 280 
Karatsuba algorithm, 43-44 
Karmarkar’s algorithm, 287 
k - core of graph, 361, 363 
Kedem-Katchalsky equation, 622 
Keldysh equation, 170 
Keller-Segel model, 465 
Kelvin, Lord (William Thomson), 66, 
446, 472, 752 

Kelvin-Helmholtz instability, 474-75, 
728 

Kelvin mode, 475 
Kelvin problem, 740 
Kelvin waves, 498, 500 
Kepler conjecture, 791, 922 
Kepler interface, 922 
Kepler’s third law, 772 
kernel: of integral equation, 200; of 
integral transform, 104 
kets, 412 
kidneys, 620-23 

kidney transplants, matching for, 

555 

Killing vector fields, 111, 582, 587 
kinematics, 448-49; algebraic geom- 
etry of, 574-75, 767-69 
kinematic viscosity, 469 
kinetic energy, 376; in quantum 
mechanics, 411 


kinetics, chemical, 627-34 
kinetic theory, 428-46; birth of, 
428-31; challenges for, 442-44; 
collisional relaxation and, 431-34, 
443; collisionless relaxation and, 
433-34, 443; entropy in, 431-33, 
437-38; Landau damping and, 
433-34, 440, 442; landmarks of, 
440-42; mathematical tools and 
trends in, 437-39; models in, 
435-37; problems that drive the 
held of, 434-35 

Kirchhoff boundary conditions, 247 
Kirchhoff’s formula, 193-94 
Kirchhoff’ s law of thermal radiation, 
486-87 

Kirchhoff’s voltage and current laws, 
15, 662-63, 665; percolation and, 
698 

KKT (Karush-Kuhn-Tucker) condi- 
tions, 285-88 
KKT matrix, 661-65 
Klein, Felix, xi, 59, 65-67, 69 
Klein four-group, and voting 
systems, 894 

Klein-Gordon equation, 142-44; dis- 
persion relationship for, 194-95 
k -means algorithm, 358 
knapsack problem, 565, 568, 570 
knots and links, 752; of 
macromolecules, 752-55 
Knudsen number, 438, 667, 670, 672 
Knuth, Donald, 48, 837, 906, 913 
Kohn-Sham model, 847, 851 
Korteweg-de Vries equation, 1 7, 
150-51, 192, 195; Painleve 
equations and, 164; spectral 
properties of, 247 
Kretschmann scalar, 584 
Kronecker delta, lOt 
Kruskal diagram, 586 
Kruskal-Szekeres coordinates, 
585-87 

Krylov subspace methods, 276-77; 
computational complexity and, 
341; in multiphysics problems, 
346-48; Newton’s method and, 

122, 346-47; in solving differential 
equations, 301, 308 
Krylov subspaces, 113-14, 118 
Kuhn length, 518 
Kullback-Leibler divergence, 355, 
593-95 

Kummer functions, 232 
Kuramoto-Sivashinsky equation, 

138, 640 

Kuratowski’s theorem, 562 
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Kutta-Joukowski hypothesis, 473 
k-way tensor, 578 

12-norm, 23; least-squares approxi- 
mation with, 30 
Leo approximation, 30, 259-60 
Loo -norm, 23 
Lp-norm, 23 
LAB color space, 811-12 
Lady Windermere’s fan, 296-97 
Lagrange, Joseph Louis, 62-65, 773 
Lagrange form of interpolating 
polynomial, 249 

Lagrange multipliers, 38-39, 221; 
constrained least squares and, 

662; in continuous optimization, 
285-86, 290-92; in portfolio 
optimization, 649-50 
Lagrange strain, 508 
Lagrangian, 108, 147, 379; symmetry 
of, 381 

Lagrangian density function, 197 
Lagrangian derivative, 162, 490 
Lagrangian description in continuum 
mechanics, 448-50; granular 
materials and, 667 
Lagrangian duality, 569 
Lagrangian function, 39, 285-86, 

291, 662 

Lagrangian mechanics, 108-9, 
379-82; generalized to special 
relativity, 110 

Lagrangian relaxation, 808, 825 
Laguerre-Gaussian beams, 676, 678 
Laguerre polynomials, 122, 231-32; 
in optics, 676; in wave functions, 
415 

A-matrices, 272 

Lambert W function, 17, 20, 151-55 
Lame constants, 521 
laminar flame speed, 853 
laminar flows, 724; in transition to 
turbulence, 728 

Lanczos algorithm, 277; in latent 
semantic indexing, 890 
Landau damping, 433-34, 440, 442 
Landau equations, 461-62 
landslides, 668, 719 
Langlands program, 591 
La Nina, 499 
LAPACK, 280, 832, 838 
Laplace, Pierre Simon, 63 
Laplace-Beltrami operator, 220, 240, 
246 

Laplace operator, 306. See also 
Laplacian 

Laplace-Runge-Lenz vector, 377 


Laplace’s equation, 16, 155-56, 
191-92; analytic functions and, 

174; bounded solution in unit 
sphere, 234; Cauchy-Riemann 
equations and, 139; coordinate 
systems for separability of, 234; 
eigenvalue problem for, 17; as 
elliptic PDE, 17, 307; Green func- 
tion for, 85, 125; integral equa- 
tions and, 201-2; for irrotational 
flow, 146; marginal sea ice zone 
and, 703; ocean surface waves and, 
493; singular behavior of, 124-25; 
transport problems governed by, 
500-501 

Laplace transform, 105, 174; inverted 
using Cauchy’s residue theorem, 
179; in time variable of PDEs, 196 
Laplacian, 16, 27, 155, 191; nonlinear 
variants of, 156; in PDEs other 
than Laplace's equation, 156 
Laplacian matrices, 370, 662, 664-65 
lapse rate, adiabatic, 489 
laser beams, 676 
Las Vegas algorithm, 48 
latent semantic indexing, 888-90 
ET E X, 42, 838, 913-16, 919 
latexdiff, 914, 9 1 5f 
lattice: Banach, 100; symmetries of, 
404. See also crystal lattice 
Laurent’s theorem, 176-77 
Lax, Peter, 1, 76-78, 338 
Lax equivalence theorem, 299, 310 
Lax-Milgram theorem, 311 
Lax pair, 151, 164 
leaky integrate-and-fire model, 876 
leapfrog scheme, 707 
learning, neural correlates of, 878-79 
least action, principle of, 379-80 
least-mean-squares algorithm, 539 
least-squares approximation, 30, 
256-57; Chebyshev series for, 
258-59; Fourier series in, 260-61 
least-squares problem: constrained, 
662; linear, 273-74, 277, 538-39 
least upper bound, 1 1 
Lebesgue constant, 249, 254, 260 
Lebesgue measure, and stochastic 
analysis, 319 
Lebesgue spaces, 99-100 
lectures for the public, 935 
Lefschetz, Solomon, 57-58, 76 
Legendre-Fenchel transform, 663 
Legendre functions, 233-34; 

associated, 414 
Legendre-Galerkin spectral 
approximation, 317 


Legendre polynomials, 23, 122, 
231-32, 257; in approximating 
solutions of PDEs, 316-17 
Legendre's equation, 233 
lemmas, 899, 903 
Leonardo da Vinci, 505 
Leonardo’s paradox, 736 
Leontief's input-output models, 279 
Leslie matrices, 278-79 
level set method, 114-16; for data 
visualization, 844; for PDEs, 194 
Lie algebra: of Killing vector fields, 
582; in normal form theory, 388; 
of special unitary group STJ ( 2) , 413 
Lie bracket, 21 
Lie groups, 405 
Liesegang patterns, 459, 466 
lift, 746-47 

Lighthill, James, 722, 783-86 
Ligh thill stress tensor, 784-85 
likelihood: in algebraic statistics, 

577; in Bayesian inference, 596, 
658-59, 661. See also maximum- 
likelihood estimate 
limit: commutativity of, 82; of func- 
tion, 11; of sequence, 11; of 
sequence in a normed vector 
space, 24 

limit cycles, 189, 389, 391; of 
van der Pol equation, 388 
Lincoln (film), 944 
Lindstedt-Poincare perturbation 
theory, 929 

linear algebra, 25-26; teaching of, 
939-40. See also linear systems; 
matrix; numerical linear algebra 
and matrix analysis; vector spaces 
linear congruential pseudorandom 
numbers, 762 

linear convergence of iteration, 34 
linear function, 10 
linear functional, 99 
linear independence, 22 
linearization, 37-38; of dynamical 
systems, 386 

linear least-squares problem, 273-74, 
277, 538 

linear momentum. See momentum 
linear multistep methods, for 
numerical solution of ODEs, 
298-300 

linear operators, 24-25; bounded, 25, 
238; in continuum mechanics, 
447-48; in quantum mechanics, 
411-13 (see also Schrodinger 
operators) 
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linear programming, 282, 286-88; in 
compressed sensing, 825; as conic 
optimization, 290; as convex opti- 
mization, 285; duality in, 286-88, 

567- 69; historical development of, 
56, 72-73, 89; (mixed-) integer, 

568- 69; integer problems in, 
567-69; relaxations in, 570; for 
traveling salesman problem, 780. 
See also continuous optimization 

linear stability analysis, 38, 185-87 
linear systems: computational cost of 
solving, 44-45, 267, 272-73; condi- 
tion number and, 26; criteria for 
existence of solution, 26; matrix 
inversion and, 21, 273; in max-plus 
semifield, 797; overdetermined 
and underdetermined, 273-74; 
sparse, 272-73. See also Gaussian 
elimination (GE); numerical linear 
algebra and matrix analysis 
linear time-invariant (LTI) operators, 
533-36 

linear transformations. See linear 
operators 

lines, method of, 194 
linkage classes of chemical network, 
633-34 

linkages, 574-75 

links (in knot theory), 752; of macro- 
molecules, 752-54 
links (in robotics), 767-69 
LINPACK, 280, 830, 832 
Liouville equation, 438 
Liouville measure, 430-31 
Liouville’s theorem, 176 
Lipschitz condition, 11, 35, 187; 

one-sided, 294 
liquid crystals, 521-23 
Lisp, 828-29, 833-34, 838 
literate programming, 837, 915 
lithotripsy, 724 
little-oh notation, 12, 211-12 
local search, 569 
logarithm, 173-75; principal, 10 
log-Euclidean mean, 280 
logistic equation, 52-53, 156-58, 

691; solutions of, 183, 187-88; 
stochastic, 325 

logistic map, 18, 157-58, 398-99, 
401-2 

log-optimal portfolio, 552 
loop shaping, 526 

Lorentz force, 161-62, 377, 380, 382; 
in magnetohydrodynamics, 

476-77, 479-80 
Lorentz group, 110 


Lorentz transformations, 110, 130, 
580; of Maxwell’s equations, 162 
Lorenz attractor: existence of, 791; 

fractal properties of, 929-30 
Lorenz equations, 158-59, 391 
Lorenz gauge, 161 
Lorenz maps, 391-92, 397, 399 
lossless compression, 547-49, 814 
lossy compression, 547-50, 812 
Lotka-Volterra system: as competi- 
tion model, 188; as predator-prey 
model, 15-16, 71 
lower bound, 1 1 
lower semicontinuity, 222 
lower triangular matrix, 2 1 
Lowner (partial) ordering, 280 
low-rank approximation, in fast 
multipole method, 776, 778 
LSQR algorithm, 277 
LTI (linear time-invariant) operators, 
533-36 

lubrication theory, 169 
LU factorization, 264-65, 275. See 
also Gaussian elimination (GE) 
Lyapunov, Aleksandr, 57, 69 
Lyapunov equation, 166, 168-69 
Lyapunov exponents, 83; in max-plus 
model, 799-800 

Lyapunov functions: for adaptation, 
593-94; in chemical reaction net- 
work theory, 634; control systems 
and, 531-33; for neuron network 
model, 877; stability of endemic 
state and, 692-93 
Lyapunov-Schmidt reduction, 

395-96 

Lyapunov-stable equilibrium, 186, 
386 

Mach, Ernst, 720 
Mach angle, 723 

Mach number, 721-22; aircraft noise 
and, 783-86 
Mach reflection, 722 
Mach surface, 722 
Mach wave, 722 
Maclaurin expansion, 175 
macromolecules, knotting and 
linking of, 752-55 
magnetic buoyancy, 479, 483-84 
magnetic diffusivity, 477 
magnetic field, 377-78; of Earth, 481; 
gravitational collapse and, 478; of 
sun, 476, 479-81, 483 
magnetic helicity, 478 
magnetic induction equation, 

476-78, 481-83 


magnetic pressure, 479 
magnetic reconnection, 479 
magnetic resonance imaging (MRI), 
816, 826; visualization methods 
for, 845 

magnetic Reynolds number, 477-79, 
482-84 

magnetic tension, 480, 484 
magnetoacoustic waves, 480-81 
magnetohydrodynamics, 476-85; 
current state of, 484; instabilities 
in, 483-84; of perfectly conducting 
fluids, 478 

magnetohydrodynamic waves, 

480-81 

magnetorotational instability, 484 
Mahler measure, 931 
Maier-Saupe theory, 521-22 
Malthusian matrix, 592-93 
Mandelbrot, Benoit, 885, 926 
manifolds, 127-30 
Maple, 33, 832, 932 
mappings, 10-11 
maps: with a gap, 770; piecewise- 
linear, 770-71; piecewise-smooth, 
769-71 

Marcenko-Pastur law, 423, 426 
market portfolio, 651-52, 656 
markets, 871; failures of, 871-72 
Markov chain Monte Carlo 
algorithms, 661 

Markov chains, 116-17; Bayesian 
application of, 133; communica- 
tion network and, 801; homoge- 
neous, 116-17; PageRank scores 
and, 755-56 

Markovian control process, 322 
Markovian stochastic differential 
equations, 642 

Markowitz, Harry, 273, 648, 651 
Markowitz pivoting, in Gaussian 
elimination, 273 
Markstein length, 854, 856 
Marmousi model, 333-34 
martingales: commodities modeled 
as, 646; moral hazard and, 873; 
optimal control and, 324; option 
pricing and, 321 
mass, dark matter and, 771-74 
mass action, law of, 616 
mass action kinetics, 628-30, 633-34; 
Bayesian approach to rate constant 
of, 659-60 

mass balance, 449-50, 457-58, 666; 

of fuel in combustion, 852 
mass conservation: in atmosphere 
and ocean, 491, 494-95; in 
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Navier-Stokes equations, 598; in 
reaction network, 632; shockwave 
and, 721; in tsunami modeling, 

715; in weather prediction, 706 
mass density, 506 
mass transport, 225-26 
matching: as linear program, 567-68; 
maximum, 565, 567; stable, 

553-55; of workers to jobs, 560 
matching-pursuit algorithm, 825 
material constants, 451-52 
material frame indifference, principle 
of, 453-54, 666 
material properties, in A T CA 
framework, 663 

Mathematica, 33, 833, 838t, 932 
mathematical modeling. See 
modeling 

mathematical programming, 283 
Mathieu functions, 159-60 
Mathieu’s equation, 159-60, 184, 234 
MATLAB, 33, 40, 106, 269, 273, 
832-34, 836; backslash operator, 
47, 833 

matrix, 20-22; banded, 272; bidiag- 
onal, 271; circulant, 21, 244, 268; 
companion, 36, 268-69; condition 
number for, 26, 263-64; defective, 
112-13; derogatory, 113; diagonal, 
21; diagonalizable, 112-13, 277; 
diagonally dominant, 263, 275-76, 
279, 345; eigenvalues of ( see eigen- 
values of matrix); functions of, 
97-99; Hamiltonian, 21; Hermitian 
( see Hermitian matrices); Hessen- 
berg (see upper Hessenberg ma- 
trix); Hessian (see Hessian matrix); 
historical background of, 66, 68, 

70, 73, 75, 263; identity, 21; ill- 
conditioned, 263; inverse of, 21 
(see also matrix inversion); invo- 
lutory, 265; irreducible, 279; 
Jacobian (see Jacobian matrix); 
Jordan canonical form of, 112-13; 
A-matrices, 272; Laplacian, 370, 
662, 664-65; Leslie, 278-79; lower 
triangular, 21; minimal polynomial 
of, 112-13; M-matrices, 279; non- 
derogatory, 269; nonnegative, 
278-79; nonnormal, 277-78; non- 
self-adjoint, 238; nonsingular, 21, 
26, 263, 266-67; normal, 238, 268, 
277; orthogonal (see orthogonal 
matrices); permanent of, 22, 45-46; 
positive-definite, 21; powers of, 

113; pseudo-unitary, 278; random 
(see random matrices); rank of, 26, 


126, 578; reducible, 279; self- 
adjoint (see self-adjoint matrices); 
similarity transformation of, 112; 
spark of, 824-25; sparse, 272-73; 
stochastic, 116, 279; symmetric 
(see symmetric matrices); symplec- 
tic, 166; Toeplitz, 21, 51, 538, 543; 
transition, 116-17, 756; transpose 
of, 21; triangular (see upper tri- 
angular matrix); tridiagonal, 270, 
272; unitary (see unitary matrices); 
upper Hessenberg, 113-14, 270; 
upper trapezoidal, 264; upper tri- 
angular, 21, 43; Vandermonde, 

255, 824; well-conditioned, 263 
matrix absolute value, 266, 280 
matrix analysis. See numerical linear 
algebra and matrix analysis 
matrix completion problems, 280, 
827 

matrix decomposition, 264n 
matrix ensembles, 420. See also 
random matrices 

matrix exponential, 38, 97-98, 107, 
184, 293, 301; inequalities involv- 
ing, 280 

matrix factorizations, 264-66; non- 
negative, 890. See also singular 
value decomposition (SVD) 
matrix inequalities, 279-80 
matrix inversion: avoidance of, in 
solving linear system, 21, 273; 
computational cost of, 44, 273; 
condition number for, 26, 263 
matrix logarithm, 98 
matrix monotone function, 280 
matrix multiplication, 21; computa- 
tional complexity of, 12, 44-45, 
267, 578 

matrix norms, 25, 263; unitarily 
invariant, 266 
matrix polynomials, 272 
matrix sign function, 98, 166 
matroids, 566, 569 
max-flow min-cut problem, 558-60, 
564-67 

max-flow min-cut theorem, 566-68 
maximum entropy principle, 131-32 
maximum-likelihood estimate, 660, 
708; in PET scan reconstruction, 
821-22. See also likelihood 
maximum norm, 23 
maximum of function, 13; of n vari- 
ables, 14 

maximum principle for stochastic 
control problem, 323-24 


maximum principles for PDEs, 17, 
198 

max-plus algebra, 795-800 
Maxwell, James Clerk, 66, 429-31, 
523, 773-74, 808 

Maxwell-Boltzmann distribution, 416 
Maxwell conditions, 430 
Maxwell fluid, 666-67, 671 
Maxwell Garnett formula, 501-3 
Maxwellian distributions, 431 
Maxwell multipole, 155 
Maxwell’s equations, 160-62, 192; 
in curved space-time, 581; Dirac 
equation and, 143-44; wave solu- 
tions of, 673-74 
McCabe complexity, 836 
mean-field theory, 84, 433-36, 439, 
443; communication networks and, 
801; of electronic structure of 
solids, 851; in magnetohydro- 
dynamics, 482-83; neuron net- 
works and, 876 
mean filter, 353 
Mean Girls (film), 944, 947 
mean-value theorem, 13 
measles, 689 

measure differential inclusions, 770 
medical imaging, 327, 816-23. See 
also magnetic resonance imaging 
(MRI); positron emission tomog- 
raphy (PET); ultrasound imaging; 
X-ray computed tomography (CT) 
Mehler-Fock transform, 234 
Melnikov methods, 398 
membrane ion channels, 617-20, 
624-25, 627; of neurons, 874 
membrane potential, 619-20; 
cardiac, 624-25 

membranes, 520-21; semipermeable, 
in kidney, 621-23 

memory: long-term, 878-79; working, 
877-78 

Menger’s theorem, 567 
Mersenne twister, 762 
Merton, Robert, 320-21, 323, 641, 
643-45, 649 

mesoscopic scale, 429-30 
metals: deformation of, 511; liquid, 
magnetic fields in, 476, 480-83 
metamaterials, 502 
metric, 24; in general relativity, 
144-45, 680; on Riemannian 
manifold, 127 
metric space, 24 

metric tensor, 128-30; in general 
relativity, 580, 582 
Michaelis-Menten kinetics, 617, 629 
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microgyroscope, reduced-order 
model of, 119 
Millennium Bridge, 236 
min-cut, and chip design, 806 
minimal polynomial of matrix, 
112-13 

minimal residual (MINRES) method, 
276 

minimal surface equation, 198 
minimal surfaces, 89, 198, 219-20, 
737, 740, 791 

minimax approximation, 259-60 
minimax methods, 221 
minimax theorem, 662 
minimum cut, 560 
minimum-degree algorithm, 273 
minimum dissipation theorem, for 
Stokes flows, 471 

minimum mean cycle problem, 808 
minimum of function, 13; of n 
variables, 14 

minimum principles, symmetric 
framework for, 661-65 
Minkowski space-time, 107, 110-11, 
130; Einstein’s field equations and, 
145-46, 685 
Minkowski tensor, 130 
minor of matrix, 22 
MINRES (minimal residual) method, 
276 

mixed-integer linear optimization, 
568-69 

mixed mathematics, 55, 55f, 59-61, 
63-66 

mixed-mode oscillations, 400-401 
mixed-norm L p spaces, 100 
MizBee, 845-46 
M-matrices, 279 

mock-Chebyshev interpolation, 250 
modeling, 2, 52-54; historical devel- 
opment of, 56, 71-72, 77; multi- 
physics, 53-54, 345-50; philo- 
sophical reflection about, 58; in 
teaching applied mathematics, 
935-38, 940-43 
mode locking, 395, 400 
model-predictive control, 530, 532 
model reduction, 117-19. See also 
dimension reduction 
models: Bayesian parameter estima- 
tion for, 658-61; coupling of, 350; 
for inverse problems, 327; in net- 
work analysis, 365-68; for optimi- 
zation problems, 281-82; symmet- 
ries of, 404; uncertainties relating 
to, 658; validation of, 2, 54, 131, 
340, 343 


modularity, of graph partition, 365, 
367 

modulation equations, 462-63, 467 
modulation spaces, 100 
modulus of complex number, 9, 173 
molecular dynamics, 456-58 
Moler, Cleve, 830, 832-33 
moment map, algebraic, 572, 577, 

579 

momentum: canonical, 108; in 
continuum mechanics, 450-51, 

454, 457; generalized, 380, 382; 
in Newtonian mechanics, 375; 
in quantum mechanics, 411, 
413-14; in special relativity, 663. 
See also angular momentum 
momentum balance, 450-51, 666; 

in fluid dynamics, 469, 598 
momentum conservation, 108, 111, 
381, 405, 450-51; in general rela- 
tivity, 582; in solid mechanics, 506; 
in tsunami modeling, 715 
momentum space wave function, 414 
Monge, Gaspard, 59, 63, 65, 225 
Monge-Ampere equation, 307, 

309-10 

monomial basis, 30 
monotone systems, control of, 

532-33 

Monte Carlo methods, 48; in Bayes- 
ian inference, 661; development of, 
57; in finance, 642; random graphs 
and, 367; random number gener- 
ation for, 761-62; in uncertainty 
quantification, 132-33, 340 
Moore-Penrose pseudoinverse, 274, 
809 

Moore’s law, 337, 340, 804, 839 
moral hazard, 872-73 
Morris-Lecar model, 619-20 
mortgage-backed securities, 644 
motifs: of biological models, 84; of 
networks, 361-63 
Motz problem, 125 
mountain pass lemma, 221 
MPEG compression, 547 
MRI. See magnetic resonance imaging 
(MRI) 

Mullins-Sekerka problem, 139 
multichannel filtering, 543-44 
multigraph, 101 

multigrid methods, 277, 308, 766 
multi-index, 306 
multiphysics modeling, 53-54, 
345-50 

multiple-precision arithmetic. See 
high-precision arithmetic 


multiple-recursive pseudorandom 
numbers, 762 

multiple scales, perturbation method 
of, 216-17 

multiplication operator, 239, 244 
multipole expansion, 775 
multipulse homoclinic orbits, 398 
multiresolution analysis, 31 
multiscale modeling, 53-54, 103, 
119-20, 225; of biological systems, 
882 

multiscale problem, in continuum 
mechanics, 455-58 
multivalued functions, 10 
multivariate functions, 11; approxi- 
mation of, 261-62 
Mumford-Shah model, 225 
MUSIC direction-of-arrival algorithm, 
542 

mutation in populations, 595 
mutual information, 550, 552 

nabla, 14 

narrow-band level set methods, 115 
Nash, John, 946-47 
Nash equilibrium, 593, 647, 870 
natural gradients, 593-95 
natural selection, 591-93 
Navier-Stokes equations, 162-63, 

192, 452, 467-70; biological sys- 
tems and, 610, 612, 615-16; 
boundary layer and, 748-49; Bur- 
gers equation and, 138; contact 
line paradox and, 169; difficulties 
with, 469-70; divergent series and, 
640; Euler equations and, 146, 163; 
exact solutions of, 470; for flame 
propagation, 852, 854-55; golf ball 
dimples and, 748-49; Hooke’s law 
and, 150; insect flight and, 743-45; 
in modeling sports, 598-602; non- 
dimensional version of, 93; quasi- 
static limit of, 471; Taylor-Couette 
flow and, 406-7, 410; turbulence 
and, 727; turbulent flame simula- 
tion and, 349; in weather predic- 
tion, 706 

Navier-Stokes fluid, 451-53, 468 
N-body problem: central configur- 
ations and, 773-74; continuum 
model and, 772-73; dark matter 
and, 771-74; fast multipole simu- 
lation of, 775-78; interval analysis 
applied to, 792-94; relative equi- 
libria in, 792-94 
(n + 2)-body ring problem, 929 
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n-dimensional Euclidean space, 22; 
norm in, 23 

nearest-neighbor method, in classi- 
fication, 356 

neighborhood of vertex, 557 
Neimark-Sacker bifurcation, 395 
nematic phase of liquid crystal, 522 
Nernst-Planck equation, 618 
Nernst potential, 617-19, 625 
network analysis, 360-74; applica- 
tions of, 370-74, 557-64; consen- 
sus formation in, 370; definitions 
and notation for, 361; heterogen- 
eity in, 362, 365, 368-70; models 
in, 365-68; outlook for, 374; pro- 
cesses in, 368-70; properties in, 
361-65; robustness in, 369-70; 
software tools for, 373-74; spread- 
ing in, 368-69, 371-72. See also 
graph theory 

networks: of chemical reactions, 
628-34; electrical and mechanical 
analogies of, 605-8; of neurons, 
876-78; timetables for rail, 796. 

See also electrical circuits 
Neumann boundary conditions, 16, 
192; approximate solution of ellip- 
tic PDE with, 317; for Laplace’s 
equation, 156 

neural network algorithms, 357 
neuroscience, mathematical, 873-79; 
introduction to, 873-74; networks 
in, 876-78; piecewise-linear map 
in, 770; plasticity in, 878-79; single- 
cell dynamics in, 874-76. See also 
brain 

neutral theory of evolution, 595 
neutron star, 478, 484, 687 
Newell-Whitehead-Segel equation, 

462 

Newton, Isaac, 62 
Newton-Cartan space-time, 111 
Newton form for interpolating 
polynomial, 249 
Newtonian fluid, 469, 612, 666 
Newtonian gravity, 376-77, 771-72, 
775 

Newtonian mechanics, 374-78 
Newton-Kantorovich theorem, 37 
Newton-Krylov methods, 346-48 
Newton polygon, 89 
Newton’s first law of motion, 375 
Newton’s method, 34, 37, 120-22, 

293; in algebraic geometry, 574; 
for algebraic Riccati equations, 

166, 168; for coupled systems, 
346-47; implementation issues, 40; 


inexact, 346-47; interval version of, 
791-92; in numerical solution of 
ODEs, 301, 304-5; in optimization, 
288-91; periodic orbits and, 929; 
for pth root of unity, 51 
Newton’s second law of motion, 181, 
375; in rotating frame of reference, 
490-91 

Newton’s third law of motion, 375 
NIST Handbook of Mathematical 
Functions, 137, 227, 235 
nodes of graph, 101; degree of, 
361-62 

Noether’s theorem, 107-9, 381-82, 
405 

noise: in biological systems, 882; 
bubble-associated, 737; white, 325. 
See also aircraft noise; Gaussian 
noise 

nominal stress, 509 
nonautonomous ODEs, 182, 182n, 

184 

nonderogatory matrix, 269 
nonhyperbolic equilibria, 186-87 
nonlinear equations: interval analysis 
for solution of, 791-92; Newton’s 
method and, 120-22, 293 
nonlinear inclusions, computer-aided 
solutions of, 794-95 
nonlinear optimization, and adaptive 
dynamics, 593-94, 596 
nonlinear programming, 283, 290-92. 

See also continuous optimization 
nonlinear Schwarz method, 347-48 
nonnegative matrices, 278-79 
nonnegative matrix factorization, 

890 

non-Newtonian fluids, shear local- 
ization in, 739 
nonnormal matrices, 277-78 
non-self-adjoint matrices, 238 
nonsingular matrices, 21, 263; condi- 
tion number and, 26; testing for, 
266-67 

nonsmooth dynamics, 769-71 
nonsmooth optimization, 283, 286 
Nordmark map, 770-71 
Nordsieck vector, 298 
normal distribution. See Gaussian 
(normal) distribution 
normal equations, 30, 256, 274, 277, 
538-39; weighted, 662 
normal form theory, 388, 400; pat- 
tern formation and, 462, 464 
normal matrices, 238, 268, 277 
normal stress, 510 


norms, 23-25, 99-100; H 2 and 
Hoo -norms, 529. See also Frobenius 
norm 

notation, 8, 9t-10t; in mathematical 
writing, 898-99, 903; programming 
language influences on, 833-34 
notification tree, 558 
NP (complexity class), 45-46 
NP-complete problems, 45 
NP-hard problems, 45; of combina- 
torial optimization, 565, 567-69, 
886; in compressed sensing, 822; 
of graph partitioning, 364; Hamil- 
ton circuit and, 779; Steiner trees 
and, 807; traveling salesman prob- 
lem as, 778, 780 
Npath metric, 836 
nuclear fusion, 476 
nullclines, 188 
null space, 25-26 
null-space method, 662 
null-space property, and compressed 
sensing, 824 
Nullstellensatz, 571 
Numb3rs (TV series), 943-47, 952 
numerical algebraic geometry, 574 
numerical analysis: historical devel- 
opment of, 62-63, 72-75, 77, 337; 
in kinetic theory, 439; in spectral 
theory, 245. See also approxima- 
tion of functions; computational 
science; continuous optimization; 
numerical linear algebra and 
matrix analysis; numerical solu- 
tion of ODEs; numerical solution 
of PDEs 

numerical linear algebra and matrix 
analysis, 263-81; computational 
cost in, 44-45, 267, 272-73, 841; 
condition numbers in, 263-64, 

266, 268; distance to singularity in, 
126, 266-67; eigenvalue problems 
in, 267-72; error analysis in, 

274- 75; iterative methods in, 

275- 77; matrix factorizations in, 
264-66; matrix inequalities in, 
279-80; nonnormality in, 277-78; 
nonsingularity in, 263, 266-67; 
notation for, 263; outlook for, 281; 
overdetermined and underdeter- 
mined systems in, 273-74; pseudo- 
spectra in, 277-78; software for, 
280, 832, 92 It; sparse systems in, 
272-73; structured matrices in, 
278-79 

numerical relativity, 145, 582-83, 
680-87 
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numerical solution of ODEs, 293-305; 
for boundary-value problems, 
304-5; consistency of, 294, 298; 
global error in, 296-97; local error 
in, 295-97; nonstiff problems in, 

297- 99; stiff problems in, 294-96, 

298- 302; structure-preserving 
methods in, 302-4 

numerical solution of PDEs, 194, 

306- 19; adaptivity in, 313-14; 
finite-difference methods in, 

307- 10, 337; finite-element 
methods in, 310-14; finite-volume 
methods in, 314-16; software for, 
920-22; spectral methods in, 
316-18 

numerical stability, 26-27; historical 
background of, 72, 75, 77, 337-38; 
recurrence relations and, 38; of 
signal-processing algorithms, 540 
numerical weather prediction, 705-12 
Nyquist rate, 533, 542 

objective function, 281-82, 285-87 
object-oriented programming, 831, 
833 

observables, quantum mechanical, 
412-13 

obstacle problem, 196, 221 
ocean. See Earth system dynamics; 
sea ice 

ocean waves, 167-68, 493; bubble- 
associated noise in, 737 
octree structure, 776 
odd function, 10 
ODEs. See ordinary differential 
equations (ODEs) 

Ohm’s law, 477, 663-65; cardiac 
diffusion tensor and, 625; modi- 
fied for ion channels, 617-18 
Okada model, 716 
open ball, 12 
open disk, 12 
open set, 12 

open-source software, 919-22 
operations research, 56, 72, 75 
operator norm, 25, 238 
operators, 24-25; adjoint, 205; 
bounded, 25, 238; quantum 
mechanical, 411-13 (see also 
Schrodinger operators); self- 
adjoint, 239-41, 412, 848 
optical fibers, 678-79 
optics, 673-80; free-space propa- 
gation in, 673-78; frequency- 
dependent properties in, 679-80; 
of guided waves, 678-79; 


phase retrieval in, 677-78, 827; 
ray model in, 676-77; time 
dependence in, 679-80 
optimal control, 524; economic 
theory and, 869-70; large-scale 
systems and, 532-33; linear 
quadratic Gaussian, 529-30; 
nonlinear systems and, 531-32; 
sensor location problem and, 

764; stochastic, 322-24. See 
also control systems 
optimal interpolation, 708-9 
optimal stopping, of stochastic 
process, 324-25 

optimal stopping theorem, 324-25 
optimal transport, 439 
optimal truncation of divergent 
series, 635-38, 640 
optimization: in digital chip design, 
804-8; finite-element methods and, 
662; global, 795; in image process- 
ing, 815; in Internet design deci- 
sions, 886-87; in nonnegative 
matrix factorization, 890; of 
portfolio, 320-21; software for, 

284, 838, 92 It; uncertainty 
quantification for, 133. See also 
adaptation; combinatorial opti- 
mization; continuous optimiza- 
tion; discrete optimization 
option pricing, 320-21, 641-42 
orbits, planetary, 377 
orbits of dynamical systems: dense, 
389-92; heteroclinic, 189-90, 384, 
387, 393, 464; homoclinic, 190, 

384, 387. See also dynamical 
systems 

order: of numerical method for 
ODEs, 297; of ODE, 15; of PDE, 191. 
See also big-oh notation; little-oh 
notation 

order notation, 12 
order reduction. See dimension 
reduction 
order stars, 300 
ordinary differential equations 
(ODEs), 14-16, 181-90; in A T CA 
framework, 663; autonomous, 
36-38, 182-83, 187, 189; boundary 
conditions for, 15; boundary-value 
problems for, 16, 185, 304-5; 
dynamical systems of, 383 (see 
also dynamical systems); equilibria 
of, 37-38, 185-90, 386; existence 
and uniqueness of solutions of, 35, 
187, 439; first-order, 15, 182-83; 
higher-order converted to first- 


order, 36, 182; linear, 15, 183-84; 
with rough coefficients, 439; 
singular points of, 184-85, 235; 
symmetry for, 407. See also 
numerical solution of ODEs 
orthogonal functions, 22-23; as basis 
functions for approximation, 
257-58; Mathieu functions as, 160 
orthogonal group 0(2), 405 
orthogonal invariant ensemble, 420 
orthogonal matching pursuit algo- 
rithm, 825 

orthogonal matrices, 21; nearest to 
given matrix, 266; polar decom- 
position and, 266; preference for, 
in numerical algorithms, 263; in 
robotics, 767 

orthogonal polynomials, 22-23, 122, 
231-32; least-squares approxima- 
tion with, 30, 257-58; in numerical 
solution of PDEs, 316-18; random- 
matrix eigenvalue distributions 
and, 424-25 

orthogonal Procrustes problem, 266 
orthogonal transformations, of ran- 
dom matrices, 420 
orthogonal vectors, 22; Cauchy- 
Schwarz inequality and, 23 
orthonormal set of vectors, 22 
oscillations, eigenvalues and, 236-37 
oscillator: Duffing, 186, 186n, 190; 
harmonic, 376, 378; quantum 
harmonic, 185, 231, 412, 848; 
van der Pol, 189, 384; weakly 
anharmonic, 216-17; Winfree, 
928-29 

oscillator arrays, chimera states in, 
928 

outer multiplication, 128 
output variables, 88 

P (complexity class), 45-46 
#P (complexity class), 46 
Pade approximants, 19, 47, 252-55; 
implicit Runge-Kutta methods 
and, 300 

Pade-Laplace method, 255 
PageRank algorithm, 4, 48, 276, 364, 
755-57 

Painleve equations, 163-65, 180, 185, 
235; random-matrix eigenvalues 
and, 424-25; WKB analysis and, 

639 

Painleve property, 163, 235 
Palais-Smale compactness condition, 
221 
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paper, mathematical: reading and 
understanding, 903-6; workflow 
for producing, 912-16; writing, 
897-903 

parabolic PDEs, 17, 307; software 
for, 921-22; spectral collocation 
method for, 318 
parallel computing, 831-32 
parallel transport, 129, 581 
parameter estimation: in signal 
processing, 544-45; solving 
nonlinear inclusions for, 794-95 
parameter-fitting problem, 658 
paraxial approximation, 675-78 
paraxial wave equation, 675, 678 
Pareto condition, 893 
Pareto optimum, 870-71 
Parlett recurrence, 99 
partial derivatives, 14; automatic 
computation of, 750-51; finite- 
difference approximation of, 96 
partial differential equations (PDEs), 
16-17, 190-200; in A T CA frame- 
work, 663; behavior of solutions 
of, 194-96; boundary or initial 
conditions for, 16, 192; classifica- 
tion of, 17, 306-7; of continuum 
mechanics, 446-47; divergent 
series and, 640; exact solutions of, 
193; historical background of, 62, 
72, 74-75, 77; homogenization of, 
103, 193; in image processing, 5-6, 
815-16; important examples of, 
16-17, 191-92, 306-7; integrable, 
151, 193; mixed elliptic-hyperbolic, 
307; nonlinear, 17, 151; notation 
for, 191; perturbation methods 
for, 193-94; propagation speeds 
associated with, 195; quasilinear 
hyperbolic, 86-87, 122-24; soft- 
ware for, 838, 920-22; technical 
methods for, 196-99; well-posed 
problems of, 199. See also bound- 
ary-value problems for PDEs; 
calculus of variations; numerical 
solution of PDEs 
partial fractions, 19 
partial sums, 1 1 
Pascal, 829-30, 834, 838t 
path, 361, 557 
path integrals, 380, 427, 676 
pattern formation, 458-67; in bio- 
logical tissues, 881-82; PDEs and, 
195, 459-60, 462, 464; symmetry 
and, 405-7, 409-10, 460-62 
pattern recognition, 351, 355-59 


pattern-search optimization 
methods, 289 

Pauli exclusion principle, 416-17, 

848 

Pauli spin matrices, 142-43, 413 
PDEs. See partial differential 
equations (PDEs) 

PDF (Portable Document Format), 31, 
905-6, 913-15 

Pearson, Karl, 57, 66, 68, 931 
penalty functions, 292 
penalty term, of inverse problem, 329 
pencil, 271 

pendulum: double, 384, 391; Hamil- 
tonian equations for, 295; inverted, 
741-43 

perceptron, 357 

percolation theory, 698; composite 
material and, 503; sea ice and, 
700-701, 703-5; spreading in 
networks and, 368 
period-adding cascade, 770-71 
period-doubling bifurcations, 157, 
395, 398-99; cardiac, 626 
periodic orbits, 189, 389; in chaotic 
system, 389; Lindstedt-Poincare 
method for computing, 929; for 
Lorenz model, 929; in (n + 2)-body 
ring problem, 929; of van der Pol 
equation, 388 

permanent of matrix, 22; computa- 
tional complexity of, 45-46 
permittivity. See electric permittivity 
permittivity tensor, 698-99 
permutations, 404-5, 553; voting 
paradoxes and, 894 
Perron-Frobenius theorem, 279 
Perron vector, 279 
perturbation theory, 208-18; asymp- 
totic expansions and, 210-12, 
244-45; basic example of, 209-10; 
convergent and divergent series in, 
212; for eigenvalues, 244-45, 
268-69; in numerical linear 
algebra, 266-68; for PDEs, 193; 
in quantum mechanics, 419; for 
regular perturbation problems, 
212-13; for singular perturbation 
problems, 213-18 

PET (positron emission tomography), 
816-23 

phase, of complex number, 173 
phase locking, in oscillator arrays, 
928 

phase-plane analysis, 188-89 
phase portrait, 385-86 
phase retrieval, 677-78, 827 


phase separation in binary alloys, 
138-39 

phase space, 182, 382; in kinetic 
theory, 429, 431 

phase transitions: T- convergence 
and, 222-23; in melting of Arctic 
sea ice, 704; in percolation theory, 
698; in superconductivity, 225 
philosophy of mathematics, 58 
photonic crystals, 502 
photoreceptors, 808 
Photoshop, Adobe, 6, 31, 812 
physiology, 616-23 
Picard iteration, 35, 346 
Picard-Lindelof theorem, 187 
piecewise polynomials, 30-31, 
251-52, 258, 305, 313-14 
piecewise-smooth dynamical 
systems, 401-2, 769-71 
77-numbers, 91-93 
pitchfork bifurcation, 395 
pivoting, 265; in Cholesky factoriza- 
tion, 265; in Gaussian elimination, 
265, 273, 275; in QR factorization, 
265, 274 

planar graph, 561-62 
Planck constant, and perturbation 
theory, 208-9 
Planck equation, 415 
planetary orbits, 377 
plane waves, 194-95, 207-8, 674-75 
plasma-/), 479-80 
plasma physics: mean-field theory 
in, 433-36, 443; oscillating gas 
bubble and, 736. See also 
magnetohydrodynamics 
plasmon resonance, 502 
plasmons, surface, 679 
plastic deformation, 511, 513; of 
granular materials, 667-68, 670-71 
plasticity, synaptic, 878-79 
Plateau borders, 737, 740 
Plateau problem, 219-20, 737 
PL/I, 829f, 830, 838t 
p-norms, 23, 99-100 
Pochhammer’s symbol, 229 
Poincare, Henri, 57, 69-70, 189-90, 
383-84, 393, 428, 432, 591, 635 
Poincare-Bendixson theorem, 189; 

van der Pol equation and, 389 
Poincare group, 110-11 
Poincare map, 388-89, 391-92; 

piecewise-linear, 770 
point at infinity, 173, 571 
Poiseuille flow, 470; in biological 
systems, 612, 615; instability in, 
474 



986 


Index 


Poisson distribution: of degrees in 
network, 366; of radioactive decay, 
817-18, 821-22 

Poisson’s equation, 16, 155, 307; dis- 
cretized, cost of solving, 45; for 
electrostatic potential in crystal, 
851; for gravitational field, 307, 
857; Green function for, 125, 140; 
in image manipulation, 6; non- 
linear, 197; with square integrable 
Laplacian, 197 
Poisson’s formula, 193 
Poisson’s ratio, 511 
polar coordinates, 9 
polar decomposition, 265-66; in 
continuum mechanics, 449, 452 
poles, 177-78, 235; of rational func- 
tion, 19; of transfer function, 524 
policy, governmental, 953-61 
Polish notation, 833 
Polya, George, 39, 404 
polyalgorithms, 47, 341-42 
polymers, 518-20 
polynomials, 18-19; algorithms for 
evaluating, 46-47; approximation 
with, 29-31, 248-52, 759-61; 
computing roots of, 36, 269; 
matrix polynomials, 272. See also 
algebraic geometry; Chebyshev 
polynomials; orthogonal poly- 
nomials; piecewise polynomials 
polynomial time (class P), 45-46 
polytopes, convex hull as, 90 
Pontryagin maximum principle, 869 
popular culture, mathematics in, 
943-52 

popular mathematics books, writing, 
906-12 

population genetics, 592, 594-95 
population models, logistic equation 
in, 156-57, 183 

porous medium equation, 195, 199 
Portable Document Format (PDF), 31, 
905-6, 913-15 

portfolio theory, 320-23, 640, 
644-45, 648-57; basic mean- 
variance analysis in, 648-52; 
extended mean-variance analysis 
in, 656-57; information theory 
and, 552; practical techniques in, 
652-57 

positive-definite matrix, 21; 
eigenvalues of, 25; Hermitian, 
factorization of, 264-65, 273 
positive-real function, 606 
positive systems, control of, 532-33 


positron emission tomography (PET), 
816-23 

posterior distribution, 658-60 
poster preparation, 915 
PostScript, 834, 838t, 913-14, 916 
potential: in classical mechanics, 147, 
376; of conservative vector field, 
156; electric, 377; electromagnetic, 
156, 161; gravitational, 775; in 
Schrodinger operator, 242, 245; 
Schrodinger’s equation with, 167 
potential energy, 376 
potential theory: electrical imped- 
ance tomography and, 334; 
integral equations and, 201-3 
power law: for degree distribution of 
network, 84, 362, 368; for Internet 
router topology, 885-86; for Inter- 
net traffic, 885 

power method for computing 
eigenvalues, 270, 275; HITS 
algorithm and, 4-5; PageRank 
scores and, 756 
power series, 20, 174-75 
Prandtl, Ludwig, 59, 66, 68, 70-71, 

82, 725, 731, 748 

Prandtl-Karman theory, 724-25, 731 
preconditioning of linear systems, 
276-77, 308-9, 341, 344; for 
coupled systems, 346-48 
predator-prey model, Lotka- 
Volterra, 15-16, 188, 922f 
preferential-attachment model, 
367-68 

prefix notation, 833 
pregnancy testing kit, 864-66 
primal-dual methods: in linear pro- 
gramming, 286-88; in nonlinear 
programming, 291, 662; in semi- 
definite programming, 290 
principal component analysis, 28, 
354, 542 

principal part of Laurent expansion, 
177-78 

principal rates of strain, 468 
principal value of complex function, 
175 

principle of dominant balance, 
213-14, 217-18 

principle of least action, 379-80 
principle of material frame 
indifference, 453-54, 666 
principle of virtual work, 512-13 
prior distribution, 659-60 
prisoner's dilemma, 594 
probability and statistics: historical 
impact of, 56-57, 66, 70, 72; 


uncertainty quantification and, 
131-33. See also statistics 
probability functions, 230-31 
probability invariant vector, of 
Markov chain, 117 
probability theory, and stochastic 
analysis, 319 

problem-solving environments 
(PSEs), 832-33, 917-20 
problem specification, 27-28 
Procrustes problem, orthogonal, 266 
programming languages, 828-39; 
complex arithmetic in, 834-35; 
early days of, 828-30; high- 
precision arithmetic in, 835-36; 
modern era of, 829f, 830-31; 
parallel computing and, 831-32; 
pitfalls for user of, 834-35; 
problem-solving environments, 
832-33, 917-20; timeline of and 
influences between, 829f. See also 
algorithms; software 
projective space, 571 
projective varieties, 571 
Project Jupyter, 832f, 833, 919, 920f 
proof: computer-aided, 790-95, 922; 
by contradiction, 40. See also 
writing about mathematics 
propagation speeds, PDEs and, 195 
propagator, 140 
prospect theory, 869-70 
proteins, knotted, 754-55 
Priifer code, 558 

PSEs (problem-solving environments), 
832-33, 917-20 
pseudocode, 41, 833 
pseudograph, 101 

pseudoinverse, Moore-Penrose, 274, 
809 

pseudorandom numbers, 761-62, 

842 

pseudo-Riemannian manifold, 107, 
111, 127, 130, 144 
pseudospectral radius, 278 
£-pseudospectrum, 238, 277-78 
pseudo-unitary matrices, 278 
publications in applied mathematics, 
55; history of journals in, 63-64, 
69, 74 

public goods, 872 
pure shear flow, 469 
Python, 8, 33, 829f, 831, 833-35, 
837-39 


q-quadratic convergence, 121-22 
QR algorithm, 270-71, 275 
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QR factorization, 264-65; for 
overdetermined and under- 
determined systems, 274 
QRS (Quinn-Rand-Strogatz) 
constant, 928-29 
q-superlinear convergence, 122 
quadratic convergence, 34 
quadratic drag, 378 
quadratic eigenvalue problem, 247, 
272 

quadratic equation: approximating 
solutions of, 37; cancellation in 
formula for solutions of, 97; 
evaluating discriminant of, 835 
quadratic programming, 282; 
sequential, 291 

quadrupole sound sources, 784 
quantum chaos, 422, 426-28 
quantum chemistry, 237 
quantum electrodynamics, 144, 162, 
419 

quantum field theory, 111, 144, 

419; Ising integrals in, 930-31; 
Langlands program and, 591 
quantum graphs, 247 
quantum information theory, 552 
quantum mechanics, 411-19; 
approximation methods in, 

418-19, 456; bound states in, 241, 
414-15, 848, 851; controversial 
issues in, 418; Dirac equation in, 
142-44; divergent series and, 639; 
measurement in, 412-13, 418; for 
multiple-particle systems, 416-17; 
spectral lines and, 237, 415; statis- 
tical interpretation of, 411-13, 418; 
summary of, 417-18; symmetries 
in, 107, 416; transitions between 
states in, 419. See also Schrodinger 
operators; Schrodinger’s equation 
quantum numbers, 415 
quantum waveguides, 242, 247 
quasiconvexity, 222 
quasigeostrophic flow, 496-98 
quasilinear hyperbolic PDEs, 86-87, 
122-24 

quasi-Newton methods, 121-22, 289, 
291 

quasispecies equation, 595 
quicksort algorithm, 48 
QZ algorithm, 272 

R, 833-34, 837, 838t 
racing cars, Formula 1, 598, 605, 
608-9 

radar, 543-44, 826-27 
radar imaging, 860-64 


Radau methods, 300-301 
radial basis functions, 261-62 
radiative forcing, 488 
radio signals: Heaviside’s divergent 
series and, 635; information theory 
and, 546; processing of, 540-41; 
sub-Nyquist sampling of, 826 
radius of convergence, 20, 174-75 
Radon transform, 206-7, 330, 

862-63, 868 
rainbow, 233, 635-36 
rainbow color map, 811-12 
ramble integrals, 931 
random graphs, 365-68 
randomized algorithms, 48 
random matrices, 419-28; charac- 
teristic polynomial of, 425; for 
compressed sensing, 824-25; 
condition number of, 426; for 
covariance matrix estimation, 
653-54; eigenvalues of, 423-25 
random number generation, 551, 
761-62, 842 

random Schrodinger operators, 
850-51 

random walks: of Escherichia coli, 
615; experimental results on, 931; 
on integers, 116-17; polymer con- 
figurations and, 518-20; in popula- 
tion dynamics, 595; random-matrix 
eigenvalue distributions and, 426 
range of linear transformation, 25-26 
range reduction, in approximation of 
functions, 760 

rank: ideal basis and, 341; of matrix, 
26, 126, 578; of reaction network, 
631-32, 634; of tensor, 129, 578 
Rankine-Hugoniot relations, 122-24, 
721, 854 

rarefaction shock, 123 
rate-of-strain tensor, 449, 468-69; 

flame propagation and, 854 
rational functions, 19; approximation 
with, 248, 252-55 
rational interpolation methods, 
118-19 

rational Krylov methods, 277 
rational mechanics, 3, 59, 61-64 
Raychaudhuri equation, 583, 588 
Rayleigh, Lord (John William Strutt), 

1, 66, 502 

Rayleigh-Benard convection, 384, 
458-59, 463, 476 
Rayleigh criterion for centrifugal 
instability, 475, 484 
Rayleigh number, 476; Lorenz 
equations and, 158 


Rayleigh-Plesset equation, 735 
Rayleigh quotient, 134 
Rayleigh range, 676 
Rayleigh-Ritz approximation, 134, 
240; in density functional theory, 
851 

Rayleigh-Schrodinger series, 213 
Rayleigh-Taylor instability, 474; 

of bubbles, 736 
reachability problem, 104 
reaction-diffusion equations, 16-17, 
192, 630; pattern formation and, 
195, 459, 463-66; in tissue 
modeling, 881 

reaction network, chemical, 628-34 
reaction rate functions, 628-29 
reaction rate in combustion, 852 
reaction vectors, 631 
reading and understanding a paper, 
903-6 

real part, 8, 173 
real-time tomography, 866-68 
rearrangement invariant spaces, 100 
rebinning methods, in real-time 
tomography, 868 

receding-horizon control, 530, 532 
recombination in populations, 595 
rectangle packing problem, 805 
recurrence relations, 18, 38 
recurrence relations, three-term for 
orthogonal polynomials, 22-23, 
122, 231 

recursive least-squares algorithms, 
539-40 

recursive summation, 41 
recursive utility, 322 
redshift: cosmological, 589; 

gravitational, 583-85 
reduced-order models. See 
dimension reduction 
reducible matrix, 279 
reflectional symmetries, 403-5 
reflectivity function, 861-62 
refractive index, 678-80 
regression techniques in pattern 
recognition, 356-57 
regular graph, 557 
regularity theory, 197; calculus of 
variations and, 223 
regularization methods, 205-6, 208, 
329, 867 

regular singular point, of ODE, 
184-85 

reinforcement learning, 596-97 
relative entropy, 550, 552, 593 
relativity. See general relativity; 
special relativity 
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relaxation: algebraic, 571; in com- 
binatorial optimization, 567, 
569-70 

relaxation oscillations, 218 
relaxation rate, of viscoelastic fluid, 
666 

relevant dimensions, 92-93 
Remez algorithm, 30, 259, 759-60 
replicator equation, 592-96 
representation theorem, 858-59 
representation theory, 405 
reproducible research in computa- 
tional mathematics, 54, 837, 
916-25 

resampling of efficient frontier, 657 
ResearchCompendia, 923-24 
residues, 177-78 
resolvent, 278 

resonances of Schrodinger operators, 
243 

restricted isometry property, 824-26 
resurgence, 638-39 
return map, 384, 391, 396-97 
return to asset, definition of, 648 
reverse Polish notation (RPN), 833-34 
reversible reaction network, 633 
Reynolds number, 82, 91, 93, 378, 
452, 469; in aerodynamics, 474; for 
arterial flow, 612; golf ball flight 
and, 747-49; high limit of, 471-72; 
insect flight and, 743-44; low limit 
of, 471; magnetic, 477-79, 482-84; 
perturbation theory and, 208-9; 
for swimming microorganism, 615; 
turbulence and, 724-25, 727-30 
RGB color space, 810-13 
Riccati equation, 15, 165-67; filtering 
problem and, 326; in linear quad- 
ratic control, 530; optimal sensor 
location and, 764-66; Sylvester and 
Lyapunov equations in solving, 

168; WKB method using, 214 
Ricci curvature, 130, 439 
Ricci identity, 581 
Ricci scalar, 581 

Ricci tensor, 129, 144-45, 581-83, 
587; in cardiac modeling, 626 
Richardson, Lewis Fry, 75, 337, 485, 
706-7, 725, 730 

Richardson extrapolation, 298, 301 
Riemann curvature tensor, 129, 
144-45, 581 

Riemann, Georg Friedrich Bernhard, 
175, 219, 229, 316, 916 
Riemann-Hilbert problem, 180-81; 
eigenvalue distribution functions 
and, 424 


Riemann hypothesis, 229, 237 
Riemannian geometry, and general 
relativity, 580-83 
Riemannian manifold, 127-30; 
cardiac anisotropy as, 626; 
eigenvalues of Laplace-Beltrami 
operator and, 246; isometric 
embedding problem of, 170 
Riemann mapping theorem, 85, 179 
Riemann problem, 316 
Riemann sheet, 173 
Riemann sphere, 571-72 
Riemann zeta function, 20, 175, 229; 
analytic continuation of, 179; 
eigenvalues of random Hermitian 
matrix and, 237; Riemann’s 
computational methods for, 916 
rigid body motion: of boat, 600; 
material frame indifference and, 
666 

rigid motions of the plane, 403-4 
rings of Saturn, 773-74 
risk, systemic, in financial markets, 
645-46 

risk aversion, in microeconomics, 

869 

RLC electric circuit, 1 5 
road network, 757 
Robertson-Walker metric, 587-89 
Robinson, Julia, 778, 780 
robotics, 767-69 
robust optimization, 133, 292 
Rodrigues formula, 231-32 
Rosenbrock methods, 301 
Rossby number, 493, 495-96 
Rossby radius, 498 
Rossby waves, 498, 500 
rotating coordinate systems, 380-81, 
490-91. See also Coriolis effect 
rotational joints, 767-68 
rotational momentum. See angular 
momentum 

rotational symmetries, 403-5 
rotation matrix, in robotics, 767-68 
rotations of inertial frames, 375 
rotation tensor, 449 
rounding, 7, 96-97; in combinatorial 
optimization, 570 
rounding errors, 7, 41, 53, 96-97, 
275, 337, 835; historical back- 
ground of, 73, 75, 77, 274-75; 
interval arithmetic and, 105, 790 
route planning, 757-59 
rowing, Olympic, 601-2 
RPN (reverse Polish notation), 833-34 
rule of fives, 697, 700-701, 704 


Runge, Carl, 56, 62, 66-68, 75, 250, 
297 

Runge-Kutta methods, 297-303 
Runge’s phenomenon, 250 

saddle, 186, 188-90 
saddle-focus, 397-98 
saddle-node bifurcation, 394, 
400-401, 928 

saddle-point matrix, 661-62, 665 
saddle points, 13-14, 386; in calculus 
of variations, 221 
Sage, 833 

sailing yacht design, 599-601 
sample variance, accurate 
computation of, 46 
sampling rate: Shannon-Nyquist 
theorem and, 826; sub-Nyquist, 
826-27 

Samuelson, Paul, 641 
Scala, 829f, 831, 838 
scalar, 20, 129 

scalar fields, visualization of, 844 
scalar potential, 377; electromag- 
netic, 161; gravitational, 775 
scalar product, 27 

scale-free networks, 84, 362, 367-68, 
373, 886 

scaling: in combinatorial optimiza- 
tion, 570; dimensional, 90-93 
scattering: from bounded scatterer, 
207; inverse problem of, 207-8, 
327, 335; in kinetic theory, 430, 
433, 436; quantum chaotic, 428; 
from radar target, 861-64; in 
seismic exploration, 331; spectral 
implications of, 242-43 
scattering states, quantum 
mechanical, 414-15, 848 
scheduling of tasks, with max-plus 
methods, 796-99 
Scheme, 829, 838t, 839 
Schrodinger operators, 241-43, 245, 
848; periodic, 849, 851; random, 
850-51 

Schrodinger’s equation, 156, 167, 

192, 411-12, 414; Dirac equation 
and, 142-43, 675; as evolution 
equation, 241; Green’s theorem 
and, 857-59; for harmonic oscil- 
lator, 185, 231, 412; nonlinear, 

151, 164, 167; Painleve equations 
and, 164; perturbation methods 
for, 212-16; scattering theory for, 
242-43; solid state physics and, 
847; in three dimensions, 414-15; 
WKB methods for solving, 214-16 
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Schur complement, 347-48, 661-62 
Schur decomposition, 270, 272, 111 
Schur vectors, and algebraic Riccati 
equation, 165-66 
Schwarz-Christoffel mappings, 

85-86 

Schwarzschild solution, 583-87; 

for black hole, 684 
Schwarz symmetrization, 223-24 
scientific computing, 336, 350. See 
also computational science 
Scilab, 832 

sea ice, 505; climate models and, 
695-98, 704; multiscale structure 
of, 696; percolation theory and, 
700-701, 703; permeability of, 697, 
701; surface geometry of, 703-5; 
weather prediction and, 712 
searching text, 887-91 
secant methods, 121-22 
secant variety, 578 
second-difference matrix, 272, 279 
second law of thermodynamics, 
431-32, 451. See also entropy 
secrecy of information, 551, 556-57 
secular terms, in asymptotic 
expansion, 216-17 
segmentation, in image processing, 
353, 814-15 

Segre variety, 573, 577-78 
Segre-Veronese variety, 573, 576 
seismic exploration, 327, 331-34 
seismic imaging, 857-60 
seismic travel-time tomography, 
330-31 

seismic waves, for tsunami source 
inversion, 716 

self-adjoint matrices, 21; spectral 
theorem for, 239; spectra of, 238, 
247. See also Hermitian matrices 
self-adjoint operators, 239-41, 412, 
848 

self-concordancy, 290 
self-consistent field algorithm, 851 
self-similar fluid flows, 470 
self-similar Internet traffic rates, 
885-87 

semicircle law, 423-25 
semiclassical analysis, 244-45 
semiclassical limit, 427-28, 438 
semiconductors, doped, 850 
semidefinite programming, 283, 290 
semidefinite relaxations, 570 
semifields, 795-800 
semiflow, 188-89 
semigroup arguments, in kinetic 
theory, 438 


sensitive dependence on initial con- 
ditions, 82-83, 158-59, 384-85, 
389, 391-92; weather and, 485 
sensitivity analysis, 132, 248, 842; of 
rational interpolant, 254. See also 
condition number 
sensitivity function, 525-28 
sensitivity index, 132 
sensitivity matrix, 304 
sensor location, optimal, 763-67 
separation of variables: in ODE, 183; 
inPDE, 185, 234 

sequence: asymptotic, 211; Cauchy, 
24; limit of, 11, 24 
sequential quadratic programming, 
291 

series solutions, 31-32, 179-80, 184 
set cover, minimizing, 565, 568 
sets, 12 

set-valued arithmetic. See interval 
arithmetic 

shallow-water equations, 167-68; 
numerical solution of, 718-19; 
in tsunami modeling, 715-16, 
718-19. See also Korteweg- 
de Vries equation 
Shannon, Claude, 73, 432, 438, 
545-46, 548-49, 551 
Shannon-Nyquist theorem, 826 
shape design, level set method for, 
116 

shape feature of images, 353-54 
shape parameter, in multivariate 
approximation, 261-62 
Sharkovskii’s theorem, 157-58, 402 
Sharpe ratio, 645, 650-51, 657 
shear, simple, 454-55 
shear bands: in foams, 739; in granu- 
lar materials, 668-70, 672-73 
shear flow: pure, 469; in transition to 
turbulence, 728-29, 731 
shear localization, in non-Newtonian 
fluids, 739 

shear modulus, of foam, 738 
shear strain, 508, 511; in granular 
materials, 667-68 
shear stress, 510-11; in blood 
vessels, 612; in foams, 739; in 
granular materials, 667-68, 670 
shear viscosity, 666 
shell theory, and leaf growth, 614 
Sherman-Morrison formula, 266 
Sherman-Morrison-W oo dbury 
formula, 266, 539 
shift-register pseudorandom 
numbers, 762 
Shilnikov flows, 397-98 


shock-capturing methods, 718 
shocks (shockwaves), 122-24, 
195-96, 720-24; of Burgers 
equation, 138, 196; Clausius- 
Duhem inequality and, 451; 
examples of, 723-24; normal, 
720-22; oblique, 721-23; scalar 
conservation laws and, 196, 199, 
720-21; shallow-water equations 
and, 715, 718 
shooting method, 304-5 
shrinkage estimators, Bayesian, 653 
SIAM. See Society for Industrial and 
Applied Mathematics (SIAM) 
sideband instability, 461, 463, 

465-67 

signal processing, 533-45; adaptive 
beamforming in, 541-44; adaptive 
filters in, 537-41, 544; algorithmic 
implementation of, 539-40; 
analog-to-digital conversion for, 
533; Bayesian, 541, 544; broad- 
band beamforming in, 543-44; 
channel equalization in, 540-41; 
compressed sensing and, 824; 
correlation in, 536-37; Fourier 
transforms in, 534-35, 543; Gauss- 
ian noise in, 536-37, 544-45; 
impulse response filtering in, 
533-34, 536; multichannel filtering 
in, 543-44; parameter estimation 
in, 544-45; tracking in, 545; 
uncertainty principle in, 927; 
z-transforms in, 535-36, 543 
signal transduction, cellular, 880-81 
signature of the metric, 130 
similarity transformation: condition 
number and, 263; definition of, 

112; to Jordan canonical form, 
112-13; unitary, 270 
simple harmonic motion, 149 
simple poles, 177-78 
simple shear, 454-55 
simplex method: computational cost 
of, 44, 284, 287; derivative-free, 
289-90; history of, 283 
simulation. See computational 
science 

single-pixel camera, 825-26 
single-scattering approximation, 
861-62 
singulant, 638 

singular integral equation, 180 
singularities, 124-25; of complex 
functions, 124, 174, 177-79 
singularities, space-time, 125, 583-89; 
in numerical relativity, 683-86 
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singular matrices, distance to, 126, 
266-67 

singular perturbation theory, 32, 
213-18, 467; neuronal dynamics 
and, 875-76, 878 

singular points of ODEs, 184-85, 235 
singular solution of ODE, 182 
singular value decomposition (SVD), 

28, 126-27, 265, 578; computation 
of, 271; ill-conditioning and, 329; 
in latent semantic indexing, 888-90; 
in linear least-squares problem, 274; 
polar decomposition and, 266; for 
principal components analysis, 354; 
regularization scheme using, 

205-6 

sinks, 386, 391 

SIR (susceptible-infectious-removed) 
models, 368, 688-91 
SIS (susceptible-infectious-susceptible) 
models, 688, 691-92 
Six Feet Under (TV series), 950 
skewness of random variable, 725-26 
The Sky at Night (TV series), 953 
slack variable, 282 
slide preparation, 915 
sliding vector field, 770 
slip surfaces, 722 
Slutsky relations, 869, 871 
Smale, Steven, 190, 383-84, 390, 428, 
574, 792 

Smale’s horseshoe, 190, 384, 390-91 
small-scene approximation, 862 
small-world model, 367 
small-world property, 363-64, 

366-68, 372-73 

smectic phase of liquid crystal, 

522-23 

smooth curve, in complex plane, 175 
smoothed analysis, 44 
smooth lit (high contact) principle, 
324-25 

smooth functions, 13-14; function 
spaces and, 100 
snow crystals, 932 
Sobel operators, 353 
Sobolev inequalities, 432, 438 
Sobolev spaces, 100; in calculus of 
variations, 219-20; Einstein’s field 
equations and, 145; in kinetic 
theory, 441 

social networks, 371-72; behavior 
contagion in, 369-70, 372; clus- 
tering coefficients of, 362-63; 
evolving, 800-803; friendship 
paradox in, 367; small-world 
phenomenon in, 363 


Society for Industrial and Applied 
Mathematics (SIAM), 7-8, 956-59 
soft matter, 516-23 
software: for network analysis, 
373-74; for numerical linear 
algebra, 280, 832, 921t; open- 
source, 919-22; for optimization, 
284, 838, 92 It; for PDEs, 838, 
920-22; writing stage of, 2. See 
also algorithms; programming 
languages 

Sokhotski-Plemelj formula, 180 
solid mechanics, 505-16; conceptual 
map of, 506; fundamental con- 
cepts of, 507-10; global formula- 
tions of, 512-13; governing equa- 
tions in, 510-12; historical back- 
ground of, 505-6; material studied 
by, 506; selected examples of, 
513-16. See also continuum 
mechanics; elasticity 
solids, electronic structure of, 

847-51 

solitary waves, 150-51; in foam drain- 
age, 739; homoclinic snaking and, 
401; in oscillator arrays, 928 
solitons, 151, 195; Painleve equations 
and, 164 
solvents, 272 
solving equations, 49-51 
Sommerfeld radiation condition, 207 
sonar, passive, 543 
sonoluminescence, 736 
SOR (successive overrelaxation) 
iteration, 75, 276 
sound: aircraft noise, 783-86; 

bubbles and, 735-37 
source, in dynamical system, 386 
source coding, 547 
source coding theorem, quantum 
version of, 552 

Southern Oscillation, 499-500 
space-time curvature, 579, 581-82 
space-time metric, 680 
spanning tree problem, 564, 566 
span of set of vectors, 22 
spark of matrix, 824-25 
sparse interpolation, 254-55 
sparse matrices, 272-73; ideal basis 
and, 341 

sparse modeling, 815 
sparse vector, 823, 825. See also 
compressed sensing 
spatial filters, 542 
spatiotemporal symmetry, 409 
special functions, 19-20, 227-35; 
areas of active research in, 235; 


Painleve equations and, 164, 235; 
as solutions of second-order linear 
ODEs, 184-85 

special orthogonal group SO (2), 405 
special relativity, 107, 110-11, 130, 
580; Dirac equation and, 142; 
Maxwell’s equations and, 162 
species-formation-rate function, 
629-32 

spectral abscissa, 277 
spectral convergence, 317 
spectral decomposition, 270 
spectral gaps, 849-50 
spectral geometry, 246 
spectral measure in Stieltjes integral 
representation, 698-700, 702 
spectral methods for PDEs, 316-18; 

in weather prediction, 708, 710 
spectral norm, 25 
spectral projection, 239 
spectral radius, 25, 238; convergence 
of stationary iteration and, 279; 
matrix norms and, 268; M-matrices 
and, 279; powers of matrix and, 

113 

spectral theorem, 239 
spectral theory, 25, 236-48; applica- 
tions of, 236-38, 246-47; calcula- 
ting eigenvalues and, 243-45; 
introduction to, 236; kinetic theory 
and, 437; nonlinear, 247; Schro- 
dinger operators in, 241-43, 848; 
of self-adjoint operators, 848 
spectroscopic lines, 237, 415 
spectrum of Hermitian matrix, 848 
spectrum of linear operator, 25, 
238-39; classification of, 240 
spectrum of random modes, 726, 
728-29 

spectrum of self-adjoint operator, 
848 

sphere packing, 517-18 
spherical coordinates, 9 
spherical harmonics, 233-34, 414-15; 

in weather prediction, 708 
spin angular momentum, 415-16 
spinors, 413; Dirac, 142-43 
spiral waves, 463-64; in heart, 624, 
626-27; in oscillator arrays, 928 
spline interpolation, 251-52 
splines, 31, 251-52; in geometric 
modeling, 787-90 
splittings: of Hamiltonian, 303; of 
matrix, 275-76, 279; of objective 
function in optimization, 292; of 
operator, 345-46; of space-time, 
681 
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sport: fluid dynamics of, 598-604; 

network analysis of, 373 
spotlight synthetic-aperture radar, 
863 

spring constant, 149 
spring system with damping, 1 5 
Squire-Landau jet, 470 
stability: convergence and, 75, 298, 
309, 462; of matrices, 279; struc- 
tural, of dynamical systems, 384, 
387-88; of vibrations, 236-37. See 
also equilibrium point; fixed point 
of dynamical system; numerical 
stability 

stable manifolds, 189-90 
stable manifold theorem, 387 
stable matching, 553-55 
stagnation point flow: in fluid 
dynamics, 470; in magneto- 
hydrodynamics, 478-79, 483 
state space: of control system, 88; 

of dynamical system, 118 
state variables, 88, 383 
stationary iterative methods, 275-76, 
279 

stationary points, 13-14, 393 
statistical distribution over phase 
space, 429 

statistical mechanics: distributions 
of, 416; random-matrix eigenvalue 
distributions and, 426 
statistics: algebraic, 576-78; online 
access to code for, 918-19. See also 
probability and statistics 
steady fluid flow, 468 
steady state vector, of Markov chain, 
117 

steepest-descent methods, 288-89, 
291, 637 

steering vectors, 542-43 
Stefan-Boltzmann law, 486-87 
Stefan problem, 196, 221 
Stegun, Irene, 74-75, 227 
Stein equation, 168 
Steiner symmetrization, 223-24 
Steiner trees, in digital chip design, 
807-8 

Stieltjes integral representation for 
effective parameter, 698-700 
stiff differential equations, 294-96, 
298-302 

stiffness matrix, 662-63 
Stirling numbers, 227-28 
Stirling’s approximation, 12, 81, 153, 
228, 635 

stochastic analysis, 319-26; applica- 
tions to finance, 320-21, 326, 


641-42, 644-45, 647; filtering 
theory in, 325-26; likelihood 
function derived from, 661; 
optimal control and, 322-24; 
optimal stopping and, 324-25; 

PDEs of nonnegative form in, 

307; random-matrix eigenvalue 
distributions and, 426; in systems 
biology, 880 

stochastic block models of random 
graphs, 367 

stochastic differential equations, 319, 
321-22, 326, 641-43. See also 
stochastic analysis 
stochastic gradient methods, 290, 
292-93 

stochastic matrices, 116; M-matrices 
and, 279 

stochastic optimization, 198, 290, 
292-93; financial markets and, 647 
stochastic processes. See Brownian 
motion; Markov chains 
stocks. See finance; portfolio theory 
stoichiometric compatibility class, 
633-34 

stoichiometric subspace, 631-34 
Stokes, George, 66, 446, 471, 635-36, 
639 

Stokes equations, 471, 615, 697 
Stokes flows, 471 
Stokes line, 636-40 
Stokes multiplier, 639 
Stokes number, 667 
Stokes phenomenon, 636, 638-39 
Stokes’s theorem, 27 
stopping time, 324 
storage functions, 531 
Stormer-Verlet method, 302-4 
strain, 149, 506-9, 511. See also 
deformation 
strain compatibility, 512 
strain ellipsoid, 449 
strain-rate tensor, 449, 468-69; 

flame propagation and, 854 
strange attractors, 391-92; homo- 
clinic tangencies and, 398 
Strassen’s method for matrix 
multiplication, 44, 54, 578 
stream function, 307, 468-69 
streamline-diffusion finite-element 
method, 312-13 
streamlines, 468-69 
strength of materials. See solid 
mechanics 

stress, 149-50, 506, 509-10; concen- 
trated at a hole, 513-14; on elastic 
materials, 452-55, 511, 513-14; 


molecular basis of, 458; residual, in 
living tissues, 614; viscous, 727-28 
stress-energy tensor, 680 
stress equilibrium, 511-12 
stress tensor, 451, 469, 509; Light- 
hill, 784-85; Navier-Stokes equa- 
tions and, 598; remodeling of bone 
and, 613; symmetric, 666; viscous, 
784, 852 

stretching tensor, 449, 451 
stretch ratio, 508-10 
stretch tensors, 449 
stripmap synthetic-aperture radar, 
863 

structural stability, 384, 387-88 
structure, preserving, 51-52 
structured programming, 829, 837 
Sturm-Liouville problem, 16, 185; 
Mathieu functions and, 160; for 
quantum harmonic oscillator, 185 
Sturm sequences, 271 
subdifferential, 286 
subgraph, 101 
subgroup, 405 
subordinate norm, 25 
subspace of vector space, 22 
successive approximation, method 
of, 35 

successive overrelaxation (SOR) 
iteration, 75, 276 
summation, algorithms for, 40-42 
summation convention, 128, 130, 

163, 379, 507, 580, 784 
sun: magnetic field of, 476, 479-81, 
483; mass of, 771-72; Schwarzs- 
child solution and, 583; tempera- 
ture of Earth and, 485-87, 491; 
three-body problem and, 773 
superasymptotics, 635-37 
superconductivity, 148-49, 225 
support vector machines, 357 
supremum (sup), 11 
supremum norm, 23, 99 
Surface Evolver, 738, 740 
surface plasmons, 679 
surface rebinning, 868 
surface tension, 736-37, 739 
SVD. See singular value 
decomposition (SVD) 

Sverdrup balance, 496 
Swift-Hohenberg equation, 401, 
459-62, 464-65, 467 
swimming microorganisms, 614-16 
swimsuits, high-tech, 602-4 
Sylvester equation, 40, 166, 168-69 
Sylvester’s inertia theorem, 271 
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symbolic computing, 32-33, 828; 

hybrid symbolic-numeric, 787-90 
symbolic dynamics, 390 
symmetric groups, 405 
symmetric linear operators, 239 
symmetric matrices, 21; diagonaliza- 
tion of, 112; eigenvalues of, 25; 
random ensembles of, 420. See 
also Hermitian matrices 
symmetrization, in calculus of 
variations, 223-24 
symmetry, 402-10; conservation 
laws and, 107-11, 381-82, 405; 
in continuum mechanics, 452; 
definition of, 404; differential 
equations and, 190, 407-10; 
dynamical systems and, 392; of the 
Lagrangian, 381; problem solving 
with, 39; in quantum mechanics, 
107, 416; reflectional, 403-5; of 
tensor indices, 580, 582; voting 
systems and, 894-95 
symmetry breaking, 111, 405, 407-10 
symmetry-breaking bifurcation, 878 
symmetry groups, 404-5; voting 
paradoxes and, 894-95 
Symm’s integral equation, 202 
symplectic flow, 302 
symplectic matrix, and algebraic 
Riccati equation, 166 
symplectic methods for numerical 
solution of ODEs: for Hamiltonian 
systems, 302-3; symplectic Euler 
method, 295-97, 303 
synteny, 845 

synthetic-aperture radar (SAR), 862-64 
systems biology, 879-83 

Takens-Bogdanov bifurcations, 
399-400 

Tam-Danielson window, 866-67 
tangent bifurcation, 394 
tangles, 754 
Taussky, Olga, 75 
tautochrone, 201 
taxi cab norm, 23 
Taylor-Couette flow, 406-7, 410 
Taylor-Proudman theorem, 494 
Taylor series, 13-14; for complex 
function, 20, 175; to linearize an 
inverse problem, 328; with matrix 
argument, 98; in numerical 
solution of ODEs, 293-95, 297; 
optimization techniques using, 

285, 288-90; remainder term 
for truncated series, 13, 29 
Taylor’s theorem, 175 


Taylor vortices, 474, 476, 728 
TCP/IP (transmission control protocol/ 
Internet protocol), 884, 886-87 
teaching applied mathematics, 

933-43; SIAM input to policy 
on, 958 

temperature distribution: in compos- 
ite materials, 103, 120; in sea ice, 
697. See also Earth system 
dynamics 

temporal graphs, 361 
tendencies, in evolution problem, 
345-46 

tensile instability, 515-16 
tensor decomposition, 578 
tensor fields, 448; visualization of, 
844-45 

tensor product patch, 576 
tensor product splines, 787-89 
tensors, 127-30; in continuum 
mechanics, 447-48; in general 
relativity, 580 

terminant integrals, 638-39 
terrorism, 803 
TjX, 837-38, 913, 915-16 
text mining, 887-91 
Theorema Egregium, 521, 614 
thermal creep effect, 431 
thermal energy of combustion, 

852-53 

thermal expansion, 511 
thermal wind, 495 
thermodynamics. See first law of 
thermodynamics; second law of 
thermodynamics 
thermostat: optimal location of, 
763-67; piecewise-smooth system 
controlled by, 769-70 
thin-film equation, 169-70 
Thomas-Fermi equation, 16 
three-body problem, 377, 773-74 
through variable, 605 
tidal forces, 579, 582, 585 
Tikhonov regularization, 206, 208, 

329, 867 

time irreversibility, 432, 456-57 
time-periodic solutions, 409 
time reversal invariance, and fields in 
interior, 857, 859-60 
time series: correlations in, 426. See 
also signal processing 
time-stepping schemes, 707, 719 
TIOBE Programming Community 
Index, 838 

tissues, biological: cardiac, 625-26; 
growth of, 613-14; modeling the 
properties of, 611 


Toeplitz matrices, 21, 51; in signal 
processing, 538, 543 
Toeplitz operators, 246 
tomography: borehole electrical, 505; 
electrical impedance, 334-35; 
positron emission (PET), 816-23; 
seismic travel-time, 330-31; for 
X-ray baggage screening, 866-68; 
X-ray computed (CT), 206-7, 327, 
330, 816-17 
TOP500, 280, 337, 954 
topological entropy, 390, 483 
topology: in image processing, 815. 

See also knots and links 
toric models, 577 
toric varieties, 572-73, 576-78 
toroidal harmonics, 234 
torque, 376 
total variation, 814-15 
tour, 778. See also traveling salesman 
problem 

tower of exponentials, 154 
trace, 25 
tracking, 545 

traction, 450-51, 509-10, 512-13 
traffic, conservation law for, 88 
transcendental functions, 19, 164, 
227, 759, 926 

transcription of genetic code, 880 
transcritical bifurcation, 394 
transfer functions: in control theory, 
524-26, 528-29; in data visualiza- 
tion, 844; optical, 674-75, 679 
transformation optics, 335, 733-34 
transition matrix, of Markov chain, 
116-17, 756 

transitivity of network. See clustering 
coefficient 

translational symmetries, 403-5 
translation of genetic code, 880 
transport: collisional, 433; collision- 
less, 433-34; linear elliptic equa- 
tions and, 198 
transportation network, 557 
transport equation, 307, 433-39 
transport properties of composite 
materials, 500-505, 698-700 
transpose of matrix, 21 
traveling salesman problem, 565, 

568, 778-81 

travel-time tomography, 330-31 
tree function, 153-54 
trees, 102, 557; enumerating, 558; of 
graph traversal algorithms, 758; 
spanning tree minimization 
problem, 564, 566 
Trefethen, Lloyd N., 75, 77, 277 
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trefoil knot, 752 
triangle inequality, 23-24; error 
estimation with, 340; traveling 
salesman problem and, 779-80 
triangular matrix. See upper 
triangular matrix 
triangular truncation, 708 
Tricomi equation, 17, 170-71, 307 
tridiagonal matrix, 270, 272 
tropical mathematics, 795-96 
Truesdell, Clifford, 3, 62, 446 
true stress, 509-10 
tsunami deposits, 717-18 
tsunami modeling, 493, 712-20; 

numerical, 718-19; uses of, 716-18 
Tukey, John, 76-77, 94, 94n 
turbo equalization, 541 
turbulence, 724-32; aircraft noise 
and, 784, 786; bacterial, 615-16; in 
boundary layer, 748; cardiac, 626; 
decay law of, 729; dynamics of, 
727-28; in fluid dynamics of sport, 
598-99; homogeneous, 728-30; 
inhomogeneous, 730-32; instabili- 
ties leading to, 475; in pattern- 
forming systems, 465, 467; random- 
ness and structure in, 725-27; 
transition to, 728; transporting 
angular momentum from accreting 
mass, 484. See also vortices 
turbulent bore, 715-16, 718 
turbulent flames, 348-50, 852, 855-56 
turbulent reacting flows, 348-50 
turbulent wave front of tsunami, 
715-17 

Turing, Alan, 56, 72-73, 76 
Turing bifurcation, 878 
Turing instabilities, 459-60, 463-65, 
802 

turning points: of Airy’s equation, 

233; in bifurcation theory, 398-99; 
in WO method, 214-16, 639-40 
twiddles, 12 
typesetting, 913 

U.K. mathematicians and 
government policy, 959-61 
ultrasound imaging, 826-27 
unbounded linear operator, 238-39 
uncertainty principle, 57,413, 

926-27 

uncertainty quantification, 131-33, 
340-41, 658; in inverse problems, 
132-33, 330; tsunami hazard 
assessment as, 717 
unconstrained optimization, 283, 

285, 288-90 


uncoupling, 35-36 
uniform asymptotic expansions, 82 
uniform distribution function, 761 
uniformly elliptic PDE, 306, 317 
uniformly hyperbolic PDE, 307 
uniformly parabolic PDE, 307 
uniform norm, 23 
uniform pseudorandom numbers, 

762 

uniform random numbers, 761 
unitarily invariant norm, 266 
unitary matrices, 21; preference for, 
in numerical algorithms, 263; in 
QR factorization, 264-65; random 
ensembles of, 420 
unitary transformations, stable 
algorithms using, 275 
unit roundoff, 41, 97, 275, 835 
unwinding number, 152 
upper bound, 11 

upper Hessenberg matrix, 113-14, 
270 

upper trapezoidal matrix, 264 
upper triangular matrix, 21; 

algorithm for inverse of, 43 
upward continuation, 859 
urinary concentration, 622-23 
U.S. science policy and funding, 
957-58 

utility function: collective choice and, 
870-71; maximizing, 868-70; port- 
folio optimization and, 320-21, 
644-45, 650 

validation of model, 2, 54, 131, 340, 
343 

value function: of optimal stochastic 
control problem, 322; of optimal 
stopping problem, 324; in port- 
folio theory, 650 

Vandermonde determinant, 33-34; 
random-matrix eigenvalue 
distributions and, 423-24, 426 
Vandermonde matrix: in compressed 
sensing, 824; in sparse interpola- 
tion, 255 

van der Pol, Balthasar, 57, 59, 70 
van der Pol equation, 189, 384, 
388-89 

van der Waals-Cahn-Hilliard theory 
of phase transitions, 223 
variance: of sample, accurate compu- 
tation of, 46; sensitivity analysis 
and, 132 

variational assimilation, 708-9 
variational methods in spectral 
theory, 240-41 


variational principle, 134; Einstein’s 
field equations and, 145; in quan- 
tum mechanics, 418. See also cal- 
culus of variations; Euler-Lagrange 
equations 

variational problems: elliptic bound- 
ary-value problem restated as, 313; 
PDEs in, 197-98; variational 
inequality problems, 325 
variation of parameters, method of, 
184 

varieties, algebraic, 570-79 
vector calculus, 27; historical 
background of, 69 
vector fields: conservative, 156; in 
continuum mechanics, 447-48; 
of dynamical systems, 383, 387; 
on manifolds, 127; of ODEs, 182, 
187-89; sliding, 770; in solid 
mechanics, 507; visualization 
of, 844 

vector potential, electromagnetic, 
161, 380 

vector product, 27 
vectors, 20-21; angle between, 23 
vector space model, for text retrieval, 
888-89 

vector spaces, 22-24 
vector triple product, 27 
vehicle suspensions, 607-9 
velocity estimation, 860 
verification, 54, 340, 343, 531, 900, 
918; of weather forecasts, 711 
Veronese variety, 573, 576 
version control, 914-16, 924, 942 
vertex covers, 568 
vertex of graph, 103, 557 
vibrations, eigenvalues and, 236-37 
virial expansion, 517 
virtual work, principle of, 512-13 
vims dynamics of HIV, 693-94 
viscoelastic fluids, 453-54, 666, 671 
visco-elastoplasticity, 667, 671 
viscoplasticity, 670-71 
viscosity: aircraft noise and, 784; of 
blood, 612-13; boundary layer and, 
82; drag due to, 378; Euler equa- 
tions and, 146-47; of flame, 852; 
kinematic, 469, 724; Maxwell’s 
kinetic theory and, 431; Navier- 
Stokes equations and, 163, 451, 
468-69; quasilinear hyperbolic 
PDEs and, 124; Reynolds number 
and, 378; thin-film equation and, 
169; turbulence and, 724, 727-28, 
731. See also Reynolds number 



994 


Index 


viscosity number of granular flow, 
667 

viscosity solution: of Hamilton- 
Jacobi equation, 199; of Monge- 
Ampere equation, 310 
viscous conservation law, 198-99 
viscous fingering, 934 
visualization, 843-47. See also digital 
imaging; image processing 
Vlasov equation and its variants, 
433-39, 441-43 

volatility models, in finance, 642 
Volterra integro-differential equa- 
tions, coupled nonlinear, 865 
volume rendering, direct, 844 
von Karman, Theodore, 59, 67, 69, 

71, 73, 720, 725 
von Karman flow, 470 
von Mises, Richard, 59, 67, 69-70, 

720 

von Neumann, John: applied math- 
ematics and, 56-59, 73; computa- 
tional science and, 336-37, 350; 
economics and, 71, 644, 650, 869; 
error analysis and, 77; foams and, 
740; Monte Carlo method and, 57; 
random number generation and, 
762; shockwaves and, 720; spec- 
tral theory and, 239-40, 426 
vortex ring, 471-72 
vortex shedding, 726, 786 
vortex sheet, 474-75 
vortex sound equation, 785-86 
vortices: in atmosphere, 492, 496; 
Burgers vortex, 470; in heart, 624; 
mean-field system of, 434; optical 
phase, 675; superconductivity and, 
225; in Taylor-Couette system, 
406-7, 410; Taylor vortices, 474, 
476, 728; transient instability 
leading to, 475 

vorticity, 469, 786; dynamics of, 
471-72; in flames, 855; potential, 
quasigeostrophic, 496-98; shock 
waves in fluids and, 722; turbu- 
lence and, 727 
vorticity equation, 469, 473; 

barotropic, 707 
vorticity vector, 727 
voting systems, 891-95. See also 
collective choice 
Voyager mission, 926 

walk, 562-63 
Walker circulation, 499 


Wasserstein metric, 355 
wave equation, 16-17, 156, 171, 192, 
241; behavior of solutions of, 194; 
coordinate systems for separability 
of, 234; in elliptic coordinates, 160; 
energy estimates and, 197; exact 
solutions of, 193; as linearized 
shallow-water equation, 168; uni- 
formly hyperbolic, 307. See also 
acoustic wave equation 
wavefronts in neuronal populations, 
877 

wave function, quantum mechanical, 
167, 411-16 

waveguides, 242, 247, 678-79 
wavelets, 31; function spaces and, 
100; image processing and, 812, 
814-15, 823, 826 

wavelike solutions of PDEs, 194-95 
wave packet, 414 
wave phenomena, 134 
weakly reversible reaction network, 
633-34 

weather, 485, 491-92; Lorenz 
equations and, 158-59 
weather prediction, numerical, 
705-12; historical background of, 
75, 706-7; visualization of, 846 
Weather Research and Forecasting 
Model, 711 

Weber functions, used in WKB 
method, 215 
Weber number, 736 
Weber parabolic cylinder functions, 
232 

web page ranking, 755-57; with 
Google PageRank, 4, 48, 276, 364, 
755-57; with HITS algorithm, 4-5, 
756 

Web sites: dissemination platforms, 
922-24; images displayed on, 28, 
812 

Weierstrass’s theorem, 29 
weighted graph, 103; spanning tree 
minimization problem for, 564, 

566 

weighted network, 361-62 
weighted spaces, 100 
weight function: in inner product, 22; 
for orthogonal basis functions, 
257-58 

Weil, Andre, 70, 591 
well-conditioned problems, 26 
well-posed problems, 50, 72, 199, 204, 
328; in numerical relativity, 682-83 


Weyl’s law, 240-41, 246 
Weyl tensor, 582-83, 587 
white noise, 325 
Whittaker functions, 232 
Wiener, Norbert, 71, 319, 545, 641 
Wiener-Hopf equations, 538 
Wiener-Hopf technique, 181 
Wiener integral, 319 
Wiener-Kinchene theorem, 537 
Wiener process, 319. See also 
Brownian motion 
Wiener solution, 538-39 
Wigner random matrices, 420, 
422-23, 425-27 

Wilkinson, James, 56, 73, 75, 77, 

275 

Willmore surfaces, 220 
Winfree oscillators, 928-29 
Wishart ensemble, 420, 422-24, 426 
witness set, 574 

WKB (Wentzel, Kramers, Brillouin) 
methods, 214-16, 637, 639-40 
workflow in producing a math- 
ematical paper, 912-16 
workflows for computational 
research, 922 

working memory, neural correlate 
of, 877-78 

wormhole, gravitational, 587, 685 
wrapper approach to feature 
selection, 354-55 
Wright co function, 152 
writing about mathematics, 897-903; 
for the general public, 906-12; 
workflow for a paper, 912-16 
Wulff shape, 220 

Xampling, 826-27 
X-ray computed tomography (CT), 
206-7, 327, 330, 816-17; for 
baggage screening, 866-68 
X-ray transform, 817-19, 867 
XYZ color space, 810-12 

yacht design, 599-601 
yield surface, 511 
Youla parametrization, 528-29 
Young, Thomas, 636, 808 
Young-Laplace law, 520 
Young’s modulus, 149, 511 

zebrafish, 459 
Zeno solution, 104 
z-transform, 535-36, 543 



